Web characterization

Jim Whitehead (ejw@ICS.uci.edu)
Thu, 17 Sep 1998 13:03:42 -0700


This is a multi-part message in MIME format.

------=_NextPart_000_0014_01BDE23B.98EF2840
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 8bit

I've recently been doing some research on exactly how the Web generates
network effects, and in the process I finally read the following excellent
paper:

"Summary of Web Characterization", James E. Pitkow, In Proc. WWW7, pages
551-558.

http://decweb.ethz.ch/WWW7/1877/com1877.htm

This is a survey of existing research on Web categorization which collects
together 8 invariant characteristics of Web behavior.

Quoted directly from table 1 of the paper, these invariants are:

Invariant
Sources
Metric

Requested file popularity
[Glassman 1994] [Cunha et al 1995] [Almeida et al 1996]
Zipf Distribution

File sizes (requested and from entire Web)
[Cunha et al 1995][Bray 1996][Woodruff et al 1996] [Arlitt and
Williamson 1996]
Heavy tailed (Pareto) with average HTML size of 46 KB and
median of 2 KB, images have an average size of 14 KB

Traffic properties
[Sedayao 1994][Cunha et al 1995][Arlitt and Williamson 1996]
Small images account for the majority of the traffic and
document size is inversely related to request frequency

Self-similarity of HTTP traffic
[Crovella and Bestavros 1995] [Gribble and Brewser 1997]
Bursty, self similar traffic between the micro second and
minute time range

Periodic nature of HTTP traffic
[Bolot and Hoschka 1996][Abdulla et al 1997a] [Gribble and
Brewer 1997]
Periodic traffic patterns able to be model by time series
analysis at the hour to weekly time range

Site popularity
[Arlitt and Williamson 1996] [Abdulla et al 1997b]
Roughly 25% of the servers account for over 85% of the traffic

Life span of documents
[Worrell 1994][Gwertzman and Seltzer 1996]
Around 50 days, with HTML files being modified and deleted more
frequently than images and other media

Occurrence rate of broken links while surfing
[WCG 1997-Xerox PARC, Virginia Tech]
Between 58% of all requested files

Occurrence rate of redirects
[WCG 1997-Xerox PARC, Virginia Tech]
Between 1319% of all requested files

Number of page requests
per site
[Huberman et al 1997][Catledge and Pitkow 1995][Cunha et al
1995]
Heavy tailed (Inverse Gaussian) distribution with typical mean
of 3, standard deviation of 9, and mode of 1 page request per site

Reading time
per page
[Catledge and Pitkow 1995][Cunha et al 1995]
Heavy tailed distribution with an average 30 seconds, median of
7 seconds, and standard deviation of 100 seconds

Session time outs
[Catledge and Pitkow 1995][Cunha et al 1995]
25 minutes, with mean time of 9 minutes

- Jim

------=_NextPart_000_0014_01BDE23B.98EF2840
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">

I've recently = been doing some=20 research on exactly how the Web generates network effects, and in the = process I=20 finally read the following excellent paper:

"Summary of Web Characterization", James E. = Pitkow, In=20 Proc. WWW7, pages 551-558.

http://decweb.ethz.ch/WWW7/1877/com1877.htm

Th= is is a=20 survey of existing research on Web categorization which collects = together 8=20 invariant characteristics of Web behavior.

Quoted directly from = table 1=20 of the paper, these invariants are:

Invariant

Sources

Metric

Requested file popularity

[Glassman 1994] [Cunha et al 1995] [Almeida et al=20 1996]

Zipf Distribution

File sizes (requested and from entire = Web)

[Cunha et al 1995][Bray 1996][Woodruff et al 1996] = [Arlitt and=20 Williamson 1996]

Heavy tailed (Pareto) with average HTML size of 4–6 = KB and=20 median of 2 KB, images have an average size of 14 = KB

Traffic properties

[Sedayao 1994][Cunha et al 1995][Arlitt and Williamson=20 1996]

Small images account for the majority of the traffic and = document=20 size is inversely related to request = frequency

Self-similarity of HTTP traffic

[Crovella and Bestavros 1995] [Gribble and Brewser=20 1997]

Bursty, self similar traffic between the micro second and = minute=20 time range

Periodic nature of HTTP traffic

[Bolot and Hoschka 1996][Abdulla et al 1997a] [Gribble = and Brewer=20 1997]

Periodic traffic patterns able to be model by time series = analysis at the hour to weekly time = range

Site popularity

[Arlitt and Williamson 1996] [Abdulla et al = 1997b]

Roughly 25% of the servers account for over 85% of the=20 traffic

Life span of documents

[Worrell 1994][Gwertzman and Seltzer = 1996]

Around 50 days, with HTML files being modified and = deleted more=20 frequently than images and other media

Occurrence rate of broken links while = surfing

[WCG 1997-Xerox PARC, Virginia Tech]

Between 5–8% of all requested = files

Occurrence rate of redirects

[WCG 1997-Xerox PARC, Virginia Tech]

Between 13–19% of all requested = files

Number of page requests
per site

[Huberman et al 1997][Catledge and Pitkow 1995][Cunha et = al=20 1995]

Heavy tailed (Inverse Gaussian) distribution with typical = mean of=20 3, standard deviation of 9, and mode of 1 page request per=20 site

Reading time
per page

[Catledge and Pitkow 1995][Cunha et al = 1995]

Heavy tailed distribution with an average 30 seconds, = median of 7=20 seconds, and standard deviation of 100 = seconds

Session time outs

[Catledge and Pitkow 1995][Cunha et al = 1995]

25 minutes, with mean time of 9=20 minutes

 
- Jim
------=_NextPart_000_0014_01BDE23B.98EF2840--