FoRK and Spam..

Dan Brickley danbri@w3.org
Mon, 18 Mar 2002 14:27:09 -0500 (EST)


(+cc: gerald)

On Mon, 18 Mar 2002, Ian Andrew Bell wrote:

> My suggestion is that Rohit might want to review his openness policy or that
> JoeBar and the guys maintaining XENT.com might want to actually decide on a
> schema for neutralizing FoRK's incredible value to email address harvesters.
[...]
> Anyway, don't you think it's time to fix the problem?

Coincidentally enough, I finally got around to moving to white-list based
filtering this weekend. I've a list of 'known senders' harvested from
various places (my sent-mail, addressbooks etc).

I started out following Gerald's recipe:

	http://impressive.net/people/gerald/2000/12/spam-filtering.html

...and got to thinking about the possibility of white-list sharing, since
my 'unknown senders' folder was initially at least still getting lost of
false hits (mostly from people on mailing lists, but also from
occasional correspondents who are known in the Webby community, but not
in my sent-mail or addressbook).

I think Gerald and others mostly don't try to whitelist for mailing lists,
and just pipe mailing lists into separate folders. I was wondering whether
if one adopted conventions for scrambling mailboxes (sha1/md5 or
whatever), it'd be possible to harvest lists of 'known senders' from FoRK
etc's list management tool.

For example, my whitelist exposed as RDF looks like:

	http://tux.w3.org/~danbri/rdfweb/foafwhite.xml

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
	  xmlns:foaf="http://xmlns.com/foaf/0.1/">
	<foaf:NonSpamMailboxURI  foaf:sha1Value="721ae0b3232bf1ce6486d952fa6629ff31e6edf6"/>
	<foaf:NonSpamMailboxURI  foaf:sha1Value="fb7efcdeb2e9ea622c8afd337299cd3b58cd35ec"/>
	<foaf:NonSpamMailboxURI  foaf:sha1Value="57f3d787a51bb9413506ad005de3f7e0e9602a17"/>
<!-- .... -->
</rdf:RDF>

You can harvest this, and use it (by downcasing and sha1-ing) to see if a
mailbox is known to me and believed to be a non-spammer. But it doesn't
readily expose my contacts list, and since it carries no semantic other
than 'mailboxes dan has heard of that he doesn't believe are used by
spammers', I'm not compromising my privacy or that of my various
correspondents.


So I was thinking I'd have a little whitelist harvesting script(*) pull in
a few of these each day from friends and colleagues, making it that bit
less likely that folk from (mumble) "the web community" would find their
messages languishing in my unknown-senders folder.

Current headache is (as mentioned here) mailing lists. I currently deliver
email lists to folders but also a copy in my inbox (since I read fast), so
I'd rather not bypass the whitelist filter for list traffic. But some
lists (eg FoRK) are open to non-subscribed posters, and hence spam.

I reckon if we had a list of mangled mailboxes detailing the legitimate
members of FoRK, I'd get less spam and false 'unknown sender' hits.

How's that sound? Anybody fancy trying this?

Dan

(*) rough-cut Ruby code that implements much of this (requires external
RDF parser) is at http://www.w3.org/2001/12/rubyrdf/util/foafwhite/foafwhite.rb