Re: Search Engine Sizes Scrutinized

Joseph M. Reagle Jr. (reagle@mit.edu)
Tue, 13 Apr 1999 11:12:39 -0400


Having just seen Gates speak at LCS's 35 Anniversary, I naturally compare
the Gates phenomena to the Web phenomena, and then I read this: the whole
Web is 320 million pages? That means every Web page could represent ~$300 of
his net worth. Or he just gave LCS about $.08 for every page out there to
build a new building.

Forwarded Text ----
=20
<Search Engine Sizes Scrutinized>
http://searchenginewatch.internet.com/sereport/9805-size.html
=20
Search Get Ranked Higher on Search Engines
Engine
Watch Search Engine Sizes Scrutinized
=20
From The Search Engine Report
April 30, 1998
=20
In early April, the mainstream Internet press went nuts
over a study in Science magazine that found no single
search engine indexes everything on the web.
=20
Visitors to Search Engine Watch know this isn't a new
discovery. I've been reporting on it over the past two
years, and the site has a page devoted to the topic of
search engine size. The site's Search Engine EKGs also
illustrate both search engine size and freshness.
=20
In fact, anyone who received HotBot's press release last
December on having the biggest index could have easily
proven that search engines don't cover everything, without
having to perform painstaking research. In it, HotBot
announced it was now indexing 110 million of the 175
million pages it estimated to exist on the web. Simple
division could tell any reporter that HotBot, the industry
leader at the time, was only covering 63% of the web.
=20
However, the prestige of a Science article on the topic
grabbed headlines, which is a good thing as most search
engine users are not educated about what goes on under the
hood of their favorite service. Better education can help
them make better choices. And the painstaking research was
crucial in providing freshness ratings for each service,
along with a new estimate of the size of the web.
=20
Does Size Matter?
=20
Before discussing the study, it's helpful to ask, does size
matter? Yes and no. If you are looking for relatively
obscure information, it's extremely helpful to have a
service with a big index. It increases the odds that a
service will bring back a match.
=20
In contrast, a large index is not necessarily helpful for
very general queries, which many users perform. In fact, a
smaller index of pages drawn from select sites may be more
useful.
=20
Many of the major services made this "better not bigger"
argument throughout 1997, when it was clear they weren't
keeping up with the growth of the web. As noted, it is
valid to a point. However, there is some degree of growth
required for them to have a decent sample of what's out
there. This issue is discussed in more depth on the "How
Big Are The Search Engines" page within the site, linked to
below.
=20
The Study
=20
Now for some specifics from the study. Researchers at the
NEC Research Institute ran the same 575 queries on HotBot,
AltaVista, Northern Light, Excite, Infoseek and Lycos. They
then counted the matching pages, with a variety of
constraints used. Duplicate pages weren't counted, the
maximum query limits for each engine were not exceeded, and
other controls were used to normalize across services.
=20
Another control was to discard any page from the count if
the exact search term did not appear. So if the search was
for "crystal," then a page would not be counted unless the
exact word "crystal" was found.
=20
The problem with this is that a page may radically change
after it is indexed. Listings may be days, weeks or even
months out of sync with the actual page. Likewise, some
sites deliver pages tailored to search engines spiders. A
human visiting the site would see a completely different
page.
=20
This probably didn't impact results greatly, especially as
the queries were scientific in nature, and thus reveal
pages that were not likely to be created by webmasters
swapping code. But it does point out a difficulty in
conducting this type of research, given that the search
engines, and the web itself, is not a controlled
environment.
=20
A better solution would be to do a count of pages retrieved
from various web sites, but the problem here is that only
AltaVista, HotBot and Infoseek allow this to be done with
any degree of accuracy. This is something the Melee
Indexing survey has tried to do.
=20
Another solution is to track the pages retrieved from known
web sites, which is what the Search Engine EKGs within
Search Engine Watch do.
=20
After the researchers filtered the results, they had in
essence a giant pool of matching pages. They then looked at
how many pages from this pool each search engine listed.
HotBot covered the most, and its coverage was used as a
baseline for estimating search engine size.
=20
For example, AltaVista found only 81% as many pages as
HotBot, so the researchers presumed that its index would
only be 81% the size of the HotBot index. The researchers
had HotBot's size from a recent press release, 110 million
web pages. So they multiplied 110 million by 81% to get an
estimate for AltaVista of 89 million web pages. A similar
calculation was done for the other search engines in the
study.
=20
The numbers aren't too off from those published by the
major search engines, except for Lycos.
=20
sci-sizes.gif (7886 bytes)
=20
The study estimated Lycos to have an index of 8 million web
pages, while Lycos says its index is above 30 million. This
lower estimate would haunt the service when the overall web
coverage numbers were calculated.
=20
(I felt that using HotBot as a baseline might skew the
results somehow, so I did the same thing using Infoseek as
the baseline, along with its published size of 30 million
web pages. The numbers remained nearly the same).
=20
With estimates of each search engine's size in hand, the
researchers then examined the overlap between the two
largest services, HotBot and AltaVista, to extrapolate a
size for the entire web: 320 million web pages.
=20
This size estimate was big news, because it exceeded by far
most other estimates. In December, HotBot was saying the
web was at 175 million web pages (they now estimate 200
million), while several other estimates were in this range
or lower.
=20
Finally, with a size estimate for the entire web, they
returned to calculate percentages of the web covered by
each search engine. This was straight-forward division.
HotBot's had the best score, 110 million out of 320 million
web pages, or 34% of the web. Lycos, estimated to have only
8 million web pages, came in last with a paltry 3%
coverage.
=20
The Lycos Problem
=20
As you might expect, Lycos wasn't very happy. It put forth
the argument that size isn't that important, but it also
stated that the study's estimate for it was off. It
reaffirmed to me, and others, that it has 30 million web
pages indexed, if not more.
=20
So what happened with Lycos? Are its published numbers to
be believed? Quite possibly. If the count is indeed too
low, the most likely culprit is in the queries that were
used.
=20
The study queries culled from among its staff, which are
mostly scientists. Thus, these are more likely to retrieve
pages from academic resources. If a search service does not
index many pages from these places, such as universities,
then it would naturally appear to have less coverage than a
service that does.
=20
Lycos falls right into this trap. It tends to index pages
from "popular" sites. A site with lots of links pointing at
it might get indexed in more depth, while a site that is
not well publicized may be missed entirely. As you can
expect, many university sites are not well publicized.
=20
Were the same survey done with a different set of queries,
an entirely different picture of coverage might appear.
This is something that authors readily acknowledge within
the study. A more accurate headline for many articles might
have been "Search engines fall short for scientists," since
this study was primarily aimed at helping them search
better.
=20
Ironically, shortly after the Science article appeared,
researchers at AltaVista's owner Digital released their
study, one that its says did use a wide range of queries.
It would have been interesting to see if Lycos performed
better with this set, but the search engines was not
included.
=20
The Digital study put the size of the web at 275 million
web pages in March 1998. It also found -- surprise! -- that
AltaVista provided the best coverage at 40%, with HotBot a
close second at 36%. Infoseek and Excite tied for third, at
16%.
=20
The Issue Of Freshness
=20
One excellent thing the NEC study did was that it rated the
freshness of each search engine's index. This is very hard
to quantify without the painstaking research of physically
verifying the existence of each page listed.
=20
Freshness is important, both because it saves people from
wasting time, and it also shows that search engines are
reflecting the current information available on the web.
Below are the percentages of bad links found in each
service:
=20
Lycos: 1.6%
Excite: 2.0%
AltaVista: 2.5%
Infoseek: 2.6%
Northern Light: 5.0%
HotBot: 5.3%
=20
While Lycos deserves honors for its low score, it's
AltaVista that should be most singled out. It combines one
of the largest web indices with a relatively low stale link
rate, an excellent balance.
=20
Unfortunately, a count of dead links from Yahoo was not
included, and it could have been done. The percentage quite
possibly would have exceeded HotBot bad score. Numerous
people complain about out of date sites, as well as
listings, with the service.
=20
=20
What To Do?
=20
=20
Statistics about index size and freshness aside, the most
important thing people want from a service is relevancy.
That's difficult to quantify, because relevancy is
subjective. Everyone has different expectations and
searching styles.
=20
For this reason, most people should think of a search
service like a pair of shoes. Try different ones on, and
wear the one that fits best. If you like the results you
get, don't worry so much that another service may have a
bigger index.
=20
Also remember that people wear different shoes for
different activities. It's the same with search services.
If you are looking for news, use a specialty news service.
If you are doing a general search, a service with a smaller
index or hand picked listings may help.
=20
But for the serious researcher, the coverage and freshness
numbers are extremely important. They help direct you
toward the players more suitable for the serious searcher.
This NEC study, my own studies and the search engines' own
published sizes indicate these are HotBot, AltaVista and
Northern Light. Not surprisingly, these services are among
the most named when librarians are asked what they use.
=20
The NEC study also suggest using metacrawlers as a good way
to get the best coverage of the entire web, since no one
search engine covers everything. See the Search Engine
Watch metacrawler page for more information about these
tools.
=20
=20
=20
More Information
=20
=20
Search Engine Sizes
=20
=20
A graphical look at how large each search engine is, with
trends over time. You will also find links to information
about the Science magazine study, an update to that study,
and a link to the similar study by Digital.
=20
=20
Search Engine EKGs
=20
Provides an idea of how large and how fresh each search
engine is.
=20
How Big Are The Search Engines
=20
Article within Search Engine Watch that explains the issues
of index size in more depth. Does size really matter? Also
has links to other resources of information, such as the
Melee Survey.
=20
Choose Another PageHomeSearch Engine Report Home Page--
Sign-Up Form-- Current Issue-- Articles By Date-- Popular
ArticlesSite Subscription InfoWebmaster's GuideFacts &
FunStatus ReportsSearch Engine ResourcesAbout The Site--
Advertising Info-- Consulting Info-- Site FeedbackSearch
The SiteSite Map
or use the site map if you can't run JavaScript.
You may also search the site.
Enter your e-mail address to receive
the Search Engine Report each month
=20
Get Ranked Higher on Search Engines
=20
By Danny Sullivan
Search Engine Watch
http://searchenginewatch.com/
Copyright=20
=A9 1996-99 Internet.com LLC
http://www.internet.com
=20
</Search Engine Sizes Scrutinized>
=20
=20
End Forwarded Text ----
_______________________ =20
Regards, http://web.mit.edu/reagle/www/
Joseph Reagle E0 D5 B2 05 B6 12 DA 65 BE 4D E3 C1 6A 66 25 4E
independent research account