[FoRK] The grand scheme of things...

Lion Kimbro <lionkimbro at gmail.com> on Fri Nov 2 12:34:51 PDT 2007

  My code doodling lately has been in what I am affectionately calling,
  "AplPy," pronounced, "Apple Pie."

  I happen to be mixed up with some (elder) programmers, who have
  nudged me repeatedly towards tinkering with APL, and thinking more
  seriously about abbreviation.

    http://c2.com/cgi/wiki?AplLanguage


  I am now in deep heresy with my friends, none of whom take me
  seriously any more...

  ...because when I want to read out a file, I write:

  O.t("filename here")

  ...and when I want to delete a file, I call:

  O.xf("filename here")

  ...or, I should more appropriately write:

  O.xf(p)  # ...since "p" is for path.

  O.xF is the death of a folder, naturally.

  Omitting the "O" module,
  ef(p) to see if a file exists at that path,
  eF(p) to see if there's a folder at that path,
  p("home", "lion", "subdir") to construct a path with os.sep,
  ...and so on.

  Should I want to print out all files in the directory, it's:

  P.p("\n---\n".join(O.t(p) for p in O.ff()))

  ...which roughly translates into:

  import os, pprint
  pprint.pprint("\n---\n".join(open(p).read() for p in os.walk(".").next()[2])
  ...but somehow seems easier for me, since "ff" ("file"-"file")
  is easier for me to remember than "os.walk(".").next()[2]",
  which required that I look up the output of os.walk.


  One of my friends' reaction was:

    "Oh my God! Why are you writing Perl code in Python?"

    http://communitywiki.org/en/AgglutativeLanguage

  Yes, I understand I violate PEP-8 in my doodles here.


  Something that happened accidentally, was that I discovered that,
  in persuing this heresy of abbreviation -- I naturally found myself
  using more functional programming oriented structures.

  Lambda's abound.  Functions that return functions.

  Somehow, this seems more natural:

import BeautifulSoup

BB = BeautifulSoup.BeautifulSoup
BS = BeautifulSoup.NavigableString
BT = BeautifulSoup.Tag
BC = BeautifulSoup.Comment

eC = lambda C: lambda n: isinstance(n, C)
eBB = eC(BB)
eBS = eC(BS)
eBT = eC(BT)
eBC = eC(BC)

  Before I was working with APL-Py, I almost never
  had a construct like "eC" above, there.

  (I *have* studied Mertz, so this isn't totally out of the blue.
   But functional programming types do seem to go more
   in the abbreviation direction, than in the Objective-C direction.)


  I now look at programs and entertain the position:
  "There's WAY too much verbage there.
   There's so much explanation, I can't tell what's going on!"


  TC,
    L  =^_^=



------------------------------------------------------------------

Attached here, a "coding doodle" I wrote briefly in AplPy.

It indexes pages linked from a set of pages listed in
fSS.txt, the "seed set."

----

"""Execute the Central Dogma (ZxtCD)

Index everything on a set of web pages, and everything on the pages
that they link to.

Store index file works in progress to a dated page in a folder (ZFI.)


Abbreviations / Documentation:

Understand the abbreviations, understand the system.

 Z -- the system, used to prefix constants and major systems
 z -- self, this (used in objects)

 f -- a file
 F -- a folder
 p -- a filepath

 SS -- "the Seed Set" -- a list of URLs, which are to be indexed
 fSS.txt -- name of text file, that holds the Seed Set, 1 URL / line
 pSS -- the path to fSS


The central system in the code is called, "the Central Dogma."

 ZtCD -- "The Central Dogma" -- a function that indexes
                                the seed set (SS).

 x -- "execute"
 xZtCD -- the function that executes the central dogma


When a URL is processed, it's turned first to HTML, and then into a
"Beautiful Soup" node.

 u -- a URL
 H -- the HTML content of the URL (u)
 n -- a tree node, representing the parsed HTML (H)
 T -- the text (no markup) in the HTML
 w -- an individual indexed word (lower case, no punctuation,
                                  no whitespace)

 ww -- a list of words
 uu -- a list of URLs


"Beautiful Soup" is the Python module that I'm using, to interpret the
web pages, and scan them for outgoing links.  It's notated with "B."

Beautiful Soup classes:

 BB -- BeatifulSoup.BeautifulSoup -- root node (interpreted HTML)
 BT -- BeautifulSoup.Tag -- a node in the tree, beneath BB
 BS -- BeautifulSoup.NavigableString
 BC -- BeautifulSoup.Comment

Revisit the "node" concept:

 n -- a BeautifulSoup node, either a BB, BT, BS, or BC instance.

Tests to see if a node is of a given class:

 C -- a class, in general (but not BB, BT, BS, BC)
 e -- test for equality, participation
 eBB -- eC(BB) -- "is this a BeautifulSoup.BeautifulSoup instance?"
 eBS -- eC(BS) -- "is this a BeautifulSoup.NvigableString instance?"
 eBT -- eC(BT) -- (similar)
 eBC -- eC(BC) -- (similar)


When the HTML (H) is parsed into a node, there are TWO transformations
it undergoes:

 H --> n --> T --> ww  -- index the text
       |
        `--> uu  -- & identify all outgoing links...


The outgoing links come to make up S2...

 S2 -- "the secondary set" -- the "Second Set" is made up of the URLs
                              found while processing the Seed Set (SS)


More completely, the picture of the Central Dogma (ZtCD) is:

 SS --> uu -> u -> H -> n -> T -> ww -> I -> fI
                       /                ^
                       `-> uu -> T -> ww-/
                          (S2)

The seed set (SS) produces URLs (uu) which are take one by one (u) and
fetched, to make HTML (H) which is then turned into a BB node (n).

The node has two outputs:  raw text for indexing (T), and links to
other pages, to be processed later (uu).

The text (T) is turned into words (ww), which are then stored in the
index (I).  Each time the index is filled up, it is then dumped to a
file representing the index (fI) for consumption by other programs.

The URLs that were culled from n (uu) are put into the
"secondary set" (S2), which are later turned into text (T) and then
words (ww) and then indexed (I) and that, too, goes to the file
representing the index (fI).

-----------------------

Addendum:  "P" -- "points"

"Points" are used to score pages based on how many times the word
appears.  A word appearing three times in a page gets 3 points.  Each
appearance in the title is worth 10 points, rather than 1.

I: {w->(P,u)(P,u)...}  rather than just I: {w->uu}

  /--> T -> ww -> PwPw --+--> I:w->PuPu
n                       /
  \--> a -> PwPw -------

a = title text

Removed functions:

  IAww -- we no longer add words to an index, without points

New functions:

  IAPwPw -- add point-word pairs to index

  nDTDPwPw -- node to text to point-word pairs
    wwDPwPw -- turn list of words into point-word pairs
    nDaDPwPw -- node to title to scored words  (10pts per word)
    PwPwAPwPw -- add point-word pairs to point-word pairs,
                 for combining nDaDPwPw results w/ wwDPwPw results

  The form of PwPw is actually:  {w:P, ...}
  ...it should more accurately be labelled, "wP"

Summary w/ functions:

|--------- nDTDPwPw--------------|

   exists already
   | | |
   v v v
  nDTDww      wwDPwPw  PwPwAPwPw    IAPwPw

  /--> T -> ww -> PwPw --+--> I:w->PuPu
n                       /
  \--> a -> PwPw -------

        nDaDPwPw

"""

import os
import os.path
import sys
import time
import socket

import re
import urllib
import cPickle as pickle

import BeautifulSoup

BB = BeautifulSoup.BeautifulSoup
BS = BeautifulSoup.NavigableString
BT = BeautifulSoup.Tag
BC = BeautifulSoup.Comment

eC = lambda C: lambda n: isinstance(n, C)
eBB = eC(BB)
eBS = eC(BS)
eBT = eC(BT)
eBC = eC(BC)


pSS = "fSS.txt"


def uDH(u):
    try:
        f = urllib.urlopen(u)
        if "text/html" not in f.info()["content-type"]:
            return ""
        else:
            return f.read()
    except IOError:
        return ""
    except socket.error:
        return ""


# uDHDn = lambda u: BB(uDH(u))

def uDHDn(u):
    n = BB(uDH(u))
    if n.title is None:
        n.title = BB()
    if n.body is None:
        n.body = BB()
    return n


nDTDww = lambda n: TDww(" ".join(n.body.findAll(text=True)))

def wwDPwPw(ww, P):
    PwPw = {}
    for w in ww:
        PwPw[w] = PwPw.setdefault(w, 0) + P
    return PwPw

Pa = 10  # Points for Title words
nDaDPwPw = lambda n: wwDPwPw(" ".join(n.title.findAll(text=True)), Pa)

def PwPwAPwPw(first, second):
    PwPw = first.copy()
    for (w, P) in second.items():
        PwPw[w] = PwPw.setdefault(w, 0) + P
    return PwPw

Pw = 1  # Points for normal words
nDTDPwPw = lambda n: PwPwAPwPw(wwDPwPw(nDTDww(n), Pw), nDaDPwPw(n))


def TDww(T):
    ww = []
    for bT in unicode(T).lower().split():
        w = XJ(bT)
        if w != "":
            ww.append(w)
    return ww

# TDww = lambda T: [XJ(bT) for bT in unicode(T).lower().split()]


ZAZ = "abcdefghijklmnopqrstuvwxyz"
XJ = lambda t: "".join([tb for tb in list(t) if tb in ZAZ])

def nDuu(n):
    uu = set()
    for nb in n.findAll('a'):
        attributes = dict(nb.attrs)
        if attributes.get("href"):
            uu.add(attributes["href"])
    return uu

def IAPwPw(I, PwPw, u):
    """Index: Add words"""
    for (w,P) in PwPw.items():
        I.setdefault(w, []).append((P,u))


d = lambda: time.strftime("%Y%m%d", time.localtime())


ZFI = "FI"  # Folder containing Index records

Ef = lambda p: os.path.exists(p)

def IDfI(I):
    """Index -> file Index"""
    pf = ZFI + os.sep + "fI" + d() + ".index"
    E = Ef(pf)
    pickle.dump(I, open(pf, "w"))
    if not E:
        I.clear()


def xZtCD(SS, I):
    """The Central Dogma"""
    S2 = set()
    for u in SS:
        print u
        n = uDHDn(u)
        IAPwPw(I, nDTDPwPw(n), u)
        IDfI(I)
        for u in nDuu(n):
            S2.add(u)
    for u in S2 - SS:
        print u
        IAPwPw(I, nDTDPwPw(uDHDn(u)), u)
        IDfI(I)


if __name__ == "__main__":
    I = {}
    SS = set(open(pSS, "r").read().splitlines())
    xZtCD(SS, I)

More information about the FoRK mailing list