[FoRK] Software through the lens of evolutionary biology

Eugen Leitl eugen at leitl.org
Sun Oct 20 01:59:37 PDT 2013


Software through the lens of evolutionary biology


My preferred job title is ‘theorist’, but that is often too ambiguous in
casual and non-academic conversation, so I often settle for ‘computer
scientist’. Unfortunately, it seems that the overwhelming majority of people
equate computer scientists to programmers or some general ‘tech person’,
forgetting M.R. Fellows rallying cry: “Computer science is not about
machines, in the same way that astronomy is not about telescopes.” Although —
like most theorists — I know how to program, the programming I do is nothing
like what (I hear) is in industry. In particular, all of my code is
relatively small and with concentration, or maybe a single sheet of paper, I
can usually keep the whole thing in my head. In fact, the only time I’ve
worked in a large code base was developing extensions for MediaWiki during my
first summer of college to be used by some groups at the Canadian Light
Source. Combined with the preceeding semester of drawing UML diagrams and
writing up req&spec documents, I was convinced that I would never be a
software engineer. However, I did learn a valuable lessons: real world
projects are big and unwieldy, logistics have to be taken seriously, comments
and documentation are your friends, and for a sufficiently large software
project there is no single person that knows the whole code.

With that much unknown, it is not surprising that bugs abound.
Even back in 2002 software bugs cost the US $59.5 billion annually or 0.6% of
the GDP, and I imagine the cost has only gone up. If you count ultrafast
extreme events or flash crashes of automated hight-frequency traders as bugs,
then some argue that you have part of our recent financial crises to blame on
software errors (Johnson et al., 2013). To get a feel for the numerosity, a
big project like Mozilla Firefox can easily get 2000 new bugs in a year (see
figure at left), and Yet most of these bugs are not particularly difficult,
and don’t require major overhauls to fix. Even the most serious failures can
be fixed by a 12 year-old, why not let evolution have a go?

Usually on TheEGG, I advocate using the algorithmic lens to view fields like
biology and ecology. Stephanie Forrest of University of New Mexico and Santa
Fe Institute, is suggesting the dual: lets use ideas from biology and ecology
to come to grips with computer science, or at least software engineering. In
particular, she has developed with her co-authors GenProg — a system for
evolutionary program repair (Weimer, et al., 2009; Le Goues, et al.,
2012a,b). She described her work at the recent Stanislaw Ulam Memorial
Lecture Series (attention conservation note: this is an hour long lecture,
watch it if you find this topic particularly fascinating. However, I will
make the post largely self-contained and you can read on without watching the
whole video):


Overall, genetic programming has a pretty bad rap, lots of promises, few
successes, and inherits little theory from more general genetic algorithms.
Part of the disappointed has stemmed from a focus on the programming
equivalence of abiogenesis (let’s call it aprogrammagenesis), starting with
an empty program and hoping to evolve one that can solve some task that we
don’t quiet know how to approach — obviously, since programmers are still
employed, wide successes have not been had. This is where we can take our
first idea from biology, a key insight of Orr-Gillespie theory (Gillespie,
1991; Orr, 2002) is that the wild type is already highly adapted for its
environment. In terms of programs: look at with what programmers wrote, it
already performs decently, and look at variations on this ‘wild-type’
programs. This is Forrest and colleagues’ starting point for GenProg.

Le Goues et al. (2012a) define fitness in terms of a set of positive
(functionality that should not be lost) and negative (faults to be repaired)
test-cases. The more test-cases you satisfy, the higher your fitness. A
program is represented as an abstract syntax tree and weighted path. The tree
contains the program automatically decomposed into a machine friendly tree
format, and the weighted path associates with each statement in the tree a
weight based on how often that statement occurs in trace runs of the test
cases. In practice, there can be thousands of test cases, or we could even
use random testing and thus produce something a-kin to Valiant’s (2009) model
of evovability — although supplying a membership oracle is a little more
subtle in practice than in theory. But for simplicity, the authors considered
only two to six positive test cases and a single negative test case (usually
based on vulnerability reports) that is weighed as 10 times more important.
This results in only four to twelve possible fitness values, and a holey
adaptive landscape — most mutations leave fitness unchanged, and some greatly
reduce fitness.

To have any hope of finding the bug, Le Goues et al. (2012a,b) depart from
biology and bias mutations to make changes to lines of high weight (visited
by many negative executions and few positive ones) more likely. They also
only consider line-level changes of either a deletion, an insertion, or a
swap/replace. For insertions and swaps, they pick a line uniformly at random
from the program to place after (for insertion) or replace (for swap) the
mutating line. This mutation mechanism really exploits Orr-Gillespie theory,
because it assumes that the high fitness program already has the right ideas
somewhere in it to fix the bugs, and they were just omitted by the programmer
in the case of the erring path. However, this also means that the process
cannot innovate new types of solutions, and so is not directly useful for
AI’s dream of aprogrammagenesis. Finally, since Forrest was a PhD student of
John Holland, her genetic algorithm have to be sexual and incorporate
crossover. Each offspring experiences crossover in the form of taking two
execution paths that are used by at least one test case, picking a random
point along the paths and swapping all statements after that cut point.
Again, this is a cross-over that is biased in a biologically unreasonable way
to exploit some domain knowledge about how typical software bugs crop up.

Due to the nature of the neutral landscape, mutations, and recombination,
when the genetic algorithm terminates, the resulting fittest mutants usually
contain orders of magnitude more changes than necessary to fix the bug. The
authors minimize this code by looking at a structured diff with the original
program, and checking each new line to see if it changes behavior on some
test case. If the line does not affect a test case then it is removed,
effectively undoing the neutral meandering that the evolution underwent. This
trimmed solution is then presented to a programmer for evaluation. Le Gous et
al. (2012b) tested this evolutionary paradigm by finding candidate programs
that had at least 50,000 lines of C, 10 viable test cases, and 300 versions
in a revision control system, these programs were: fbc, gmp, gzip, libtiff,
lighttpd, php, python, wireshark. They ran their evolutionary algorithms on
Amazon’s cloud servers for upto 12 hours per bug ($8.80 max spent of
computing time), and ended up fixing 55/105 bugs at an average cost of $7.32.
I imagine that one would be hard-pressed to find a programmer willing to work
that cheap.

The important question for me is not if this works, but why it works. All the
results in the previous paragraph involved populations of 40 programs with
just 10 generations, an evolutionary leap similar to the speed of
evolutionary dynamics in the human immune system’s affinity maturation
response to finding a bug in its function. Given how hard finding a local
fitness peak is for general fitness landscapes (Kaznatcheev, 2013), what are
the special features of the software fitness landscape that make it so
friendly to adaptation? Schulte et al. (2013) answer this by specializing the
arguments of evolutionary economics to software:

Today’s software arose through fifty years of continued use, appropriation,
and refinement by software developers. The tools, design patterns and codes
that we have today are those that have proven useful and were robust to
software developer’s edits, hacks, and accidents, and those that survived the
economic pressures of the marketplace. We hypothesize that these evolutionary
pressures have caused software to acquire mutational robustness resembling
that of natural systems.

In particular, the authors show that in the associated fitness landscape
created by their mutation operators and the test suites, an average of 36.8%
and a minimum of 21.2% (across 22 real-world programs involving 150,000 lines
of code and 23,151 tests) of mutations are neutral. Since the hardness
results in Kaznatcheev (2013) rely on non-neutral landscapes, neutrality
could be an answer to the speed and effectiveness of evolutionary bug-fixing,
but I don’t think it is the whole story. I think it is worthwhile to look
more than one mutation out, and look at the prevalence of epistasis in the
software fitness landscape. I think this is the greatest contribution
software engineering can make to biology. In evolutionary biology, the
notoriously difficulty of getting good data on the structure of empirical
fitness landscapes has lead to a disconnect between theory and experiment
(Orr, 2005; Kryazhimskiy et al., 2009), why not test our biologist theories
against the landscapes observed in the digital ecosystems created by humans?
Ecologist are already trying this by applying ecological theories to the
detailed data of the financial ecosystem (May et al., 2008), why not do the
same for evolutionary biology and use software engineering as an artificial
laboratory for studying the structure of empirical fitness landscapes?


Gillespie, J.H. (1991). The causes of molecular evolution. Oxford University

Johnson, N., Zhao, G., Hunsader, E., Qi, H., Johnson, N., Meng, J., & Tivnan,
B. (2013). Abrupt rise of new machine ecology beyond human response time.
Scientific Reports, 3.

Kaznatcheev, A (2013). Complexity of evolutionary equilibria in static
fitness landscapes. arXiv: 1308.5094v1.

Kryazhimskiy, S., Tkacik, G., & Plotkin, J.B. (2009). The dynamics of
adaptation on correlated fitness landscapes. Proc. Natl. Acad. Sci. USA
106(44): 18638-18643.

Le Goues, C., Nguyen, T., Forrest, S., & Weimer, W. (2012a). GenProg: A
generic method for automatic software repair. Software Engineering, IEEE
Transactions on, 38(1), 54-72.

Le Goues, C., Dewey-Vogt, M., Forrest, S., & Weimer, W. (2012b). A systematic
study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In
Software Engineering (ICSE), 2012 34th International Conference on (pp.
3-13). IEEE.

May R.M., Levin S.A., & Sugihara G. (2008). Ecology for bankers. Nature,
451(7181): 893-5.

Orr, H.A. (2002). The population genetics of adaptation: the adaptation of
DNA sequences. Evolution 56: 1317-1330.

Orr H.A. (2005). The genetic theory of adaptation: a brief history. Nature
Reviews Genetics, 6(2).

Schulte, E., Fry, Z. P., Fast, E., Weimer, W., & Forrest, S. (2013). Software
Mutational Robustness. Journ. Genetic Programming and Evolvable Machines.
DOI: 10.1007/s10710-013-9195-8

Valiant, L.G. (2009) Evolvability. Journal of the ACM 56(1): 3.

Weimer, W., Nguyen, T., Le Goues, C., & Forrest, S. (2009). Automatically
finding patches using genetic programming. In Proceedings of the 31st
International Conference on Software Engineering (pp. 364-374). IEEE Computer

More information about the FoRK mailing list