FW: Flawed AltaVista Internet Search Engine

Dan Kohn (dan@teledesic.com)
Thu, 27 Mar 1997 10:22:40 -0800

This is hard to believe.

- dan

-----Original Message-----
From: Phil Agre [SMTP:pagre@weber.ucsd.edu]
Sent: Wednesday, March 26, 1997 6:05 AM
To: rre@weber.ucsd.edu
Subject: Flawed AltaVista Internet Search Engine

[It seems to me that the whole keyword-based search engine paradigm on=20
the Web collapsed back in the fall sometime. At least that's when I=20
stopped being able to find anything on the Web using Lycos, Alta=20
Vista, etc unless I had an obviously unique set of words to search on,=20
if then. Now that the Web has outgrown indexing and search methods=20
that librarians rejected decades ago, maybe it will come time to get=20
some serious ideas about the subject. We may even have to listen to=20
the librarians' opinions. Now, some people are out there trying to=20
catalog the Web using library cataloging principles. But (as the=20
librarians well know) that doesn't work because URL's are too=20
impermanent; I've given up trying to cooperate with people who think=20
they're cataloging Web-based periodicals such as The Network Observer.=20
We need some different metaphors for cataloging and for the Web.=20
Once we get over this IPO-driven mania about "push" technology, maybe=20
we can get back to business and rethink what it means to order=20
information in a totally decentralized environment.]

-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D =20
This message was forwarded through the Red Rock Eater News Service=20
Send any replies to the original author, listed in the From: field=20
You are welcome to send the message along to others but please do not=20
the "redirect" command. For information on RRE, including=20
for (un)subscribing, send an empty message to=20
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D =20

Date: Wed, 26 Mar 1997 08:20:28 -0500
From: John Pike <johnpike@fas.org>
To: pagre@weber.ucsd.edu
Subject: Flawed AltaVista Internet Search Engine

"As web-surfing enthusiasts already know, AltaVista is a program
that will search the entire Web..." was the way Amy Schwartz
introduced a review of the new book "The AltaVista Search
Revolution" on the oped page of the Washington Post ["The
Information Laundromat" 22 March 1997].
http://discuss.washingtonpost.com/wp-srv/WPlate/1997-03/22/015L-032297- =

While AltaVista is indeed an estimable implementation, most=20
web.surfers will be astonished to learn that, contrary to this=20
conventional wisdom, AltaVista indexes only a small, flawed, arbitrary=20
and not even random sample of what is on the web today.
Estimates of the total content of the web are of necessity=20
speculative, but run as high as 150 million pages. AltaVista claims <=20
http://altavista.digital.com/ > to be "the largest Web index: 31=20
million pages found on 476,000 servers." So where are the missing=20
pages ?? [or as Ronald Reagan asked "where is the rest of me??].
There are many reasons a web page might not show up in the AltaVista=20
index. Some parts of some sites are hidden from public view with the=20
Robots Exclusion Protocol, which tells search engines not to index=20
certain pages. Other types of content, such as the Adobe Portable=20
Document Format [PDF] do not currently support indexing. Some large=20
sites dynamically generate their content, rendering it invisible to=20
search engines. And other sites have security access controls which=20
may [or may not!!! but that is another story.... ] preclude indexing=20
their pages.
But surely this does not explain why the estimable AltaVista indexes=20
only 20% of the web.
The AltaVista FAQ sez:
>How do I submit my site to AltaVista?
>Use our Add URL feature, found at the bottom of every
>page. Simply type in the main URL for your site. You can
>submit several URLs, but it is considered bad taste to
>manually submit your entire site: just let Scooter do this for you.

This certainly creates the impression that once AltaVista has even
one URL from a site, it will automatically [in the fullness of time,
but that is another story as well....] include the entire site in
its widely used index. Certainly, this claim is the reason that
AltaVista is so widely relied upon, and the reason that most
web.users assume that "if it ain't in AltaVista, it ain't online"
I webmaster the Federation of American Scientists site,
which is a medium-sized website with some 6,000 pages and about =BD Gig=20
online. Recently I noticed that the Alta Vista search engine seemed to=20
only index about 600 of our pages. I thought that this was rather odd,=20
since I had long had the impression that AltaVista indexed pretty much=20
everything, or at least made a good-faith best effort to do so. I=20
asked them about this, and this is what I got back:
>Date: Tue, 18 Mar 1997 09:08:39 -0800 (PST)
>From: Alta Vista Support
>To: johnpike
>Subject: Re: AltaVista not indexing www.fas.org
>That is probably a good estimate...We have 600 pages from you indexed=20
>the system. You will probably not see much more than that for any=20
>domain. Goecities has 300...and they have 300,000 members.

I confess that I was rather horrified as I contemplated the=20
implications of this [which can be verfied by searching AltaVista on <=20
host:geocities.com > ... try this trick on your own domain and see=20
what happens!!!].
For a medium to large site, such as ours, it means that they are only=20
indexing some arbitrarily selected subset of our total content. Thus=20
corporations, universities, or most other really content-rich sites=20
will be poorly represented in their index.
It also means that for smaller entities that do not have their own=20
domain, their content will also not be indexed. As in, are the=20
reported 300,000 users of Geocities aware that the fact that their=20
pages are hosted @ www.geocities.com [or the larger number of folks=20
who are hosted @ members.aol.com] means that they are effectively=20
invisible to AltaVista, one of the most widely used and admired search=20
What this seems to mean is that medium-sized sites of a few hundred=20
pages are going to show up nicely in AltaVista, but larger and smaller=20
implementations will be nearly invisible, which is a rather odd way of=20
doing things. I mean, this is sorta like buying a map that shows some=20
arbitrary number of roads but doesn't have any of the main=20
interstates, or a phone book that only has even-numbered phone=20
numbers, or something.
I confess that I was not previously aware of this practice of=20
AltaVista, which is certainly not been previously reported anywhere,=20
and is certainly @ variance with their apparent claims that if you=20
supply them with one URL from their site they will spontaneously=20
include the rest of their site in their index.
This is not to trash AltaVista, which at least has an implementation=20
that enables one to determine just how many of your pages are in their=20
index [I can't seem to make the other engines do this neat trick]. But=20
it is to say that anyone whose online presence has been predicated on=20
their entire site [large or small] showing up in AltaVista had better=20
think again. And that anyone trying to search the 'entire' web [as=20
opposed to some arbitrary sample thereof] had best look somewhere=20
other than AltaVista.
Frankly, I think this is a more significant story than the widely=20
reported "flawed Pentium chip" or "browser security flaws" stories.=20
These highly visible episodes affected only a small number of users,=20
or were more in the nature of theoretical problems. But AltaVista=20
claims to be used nearly 30 million times a day, so this "undocumented=20
feature" of AltaVista affects nearly everyone who uses the web=20
[doesn't everyone???].
As someone who uses AltaVista many times a day, and whose webpresence=20
strategy had been predicated on "If I build it, they will come, cause=20
they will find it in AltaVista" this has really come as a shock to me,=20
and I imagine that it would come as a shock to many others as well. I=20
mean, it is one thing to admit that regenerating a web.wide index=20
takes a long time, and that your index goes stale after a month or so,=20
but it is another to admit that you are just not even trying to index=20
large sites, or small sites that are appended to an ISP's domain, and=20
I am pretty astounded.
To keep track of this issue Melee's Indexing Coverage Analysis (MICA)
examines the relative page coverage for a select group of search=20
engines. Each week, Melee Productions will retest the engines on the=20
list and publish an update to the MICA Report. They will be happy to=20
test any publicly accessible search engine that supports date-range=20
and host/domain constraints, and purports to index at least one fifth=20
of the "web".
Stay tooned for further developments!!!


John Pike
Director, Space Policy Project
Federation of American Scientists
307 Massachusetts Ave. NE
Washington, DC 20002
V 202-675-1023, F 202-675-1024, http://www.fas.org/spp/