25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more...
Home   Projects   Up   Sitemap   Search   Blog   Forum+Chat   About Us   Privacy   Terms of Use   Feedback   FAQ   Images   Services   Ads   Donate   Humor

Up 1-Learning 2-Patterns 3-Real Life 5-Classifiers 6-Biology 7-Neural Nets 8-Connectionism 9-ALife 10-AI

Essays on Complex Systems Part 4 - Genetics

Real Genetics (Biology)

How Species Diverge

If the dividing line between species is whether they can successfully reproduce together, then how does the split occur?

It can't be a sudden, deciding, mutation in one individual, because it couldn't breed with others to pass it on. And you can't expect it to occur simultaneously in more than one individual.

Yet, when two species diverge, there does have to be a point where they can interbreed, and then a successive point where they can't, and a specific cause of the change.

One thing seems certain: they must stop interbreeding long before it is impossible to do so. Otherwise, through the interbreeding itself, they would remain compatible.

So there must be an initial social split, one group either physically or socially separating from the other, which then allows a physical divergence to take place.

That's the key. It really is a divergence of entire populations, not of individuals.

All the individuals within each population remain compatible with each other, but the social or physical separation of the populations allows them to diverge.

An example of how this can be misunderstood:

Upright walking is often cited as something that caused humans to diverge from other apes, especially because it brought us out of the trees, while they remained there.

But this is exactly backwards. Upright walking didn't drive humans out of the trees, nor even allow them to leave.

They had to leave the trees first. It was only after the behavioral move out of the trees (and coincidentally away from the other apes) that mutations in the direction of upright walking would have been selected for.

Another example was described in Scientific American Frontiers (Spiders) April 1999: 2 types of males in 1 species of spiders, with different courting behavior, both successful, but 1 safer for the male.

A more obvious example is the evolution of American English and its many dialects from the original British.

Separation has to precede the differentiation, else the population would simply remain homogeneous.

Related question: with TV and increased mobility, do children today have less pronounced regional accents, and are regional idioms disappearing, as language fuses back into one? Notable exception: "ebonics", discussed elsewhere as an example of isolation producing divergence.

Once a population is physically scattered, it seems inevitable that they will diverge, just through random variation.

So any event, even a short term one (flood, a river becoming uncrossable, drought, earthquakes, volcanoes), could initiate the divergence, as long as the scattered populations don't reassemble for a long time.

It would seem to be a corollary that the more separated ecosystems there are, the more different species there should be.

----------

Species classification is somewhat arbitrary, based only on the portion of DNA that allows interbreeding.

You could have a huge divergence in the rest of the DNA only, producing radically different members of the same species (dogs), or a divergence in only the reproductive sections, producing individuals of different species but identical in every other respect.

A number of mutations are said to produce sterility as a side effect, but are the individuals only deemed sterile because they can't produce offspring with the majority of other existing individuals, or because they actually lack something essential to reproduction?

If the former, and if they could theoretically produce offspring if a compatible individual existed, then they're not sterile; they're a new species.

 

Misc.

  • Much DNA is common to all life: % in common with humans: yeast 20-30%, chicken 80%, cow 90%, chimpanzee 99%.
  • Human DNA can repair faulty or damaged yeast DNA.
  • Human DNA has about 109 base pairs in 105 genes. (I think one gene is the complete code for one protein.)
  • Humans do not have the most chromosomes. Sheep have more. In total amount of DNA, do they have more? What else has more than we do? In a sense, anything that can create more proteins than humans is "more evolved".
  • A recessive trait could be one that once was beneficial or sometimes is. White vs dark moths in England -- dark moths couldn't have taken over if dark wasn't possible. So it's like a capability held in reserve. It is also something like the default vs specialized rule hierarchy seen in classifier systems. ("Usually produce a white moth, but sometimes produce a dark one, just in case.")
  • Apparently, as individuals age, their DNA breaks into shorter strings. If a break occurs in the middle of the coding for a protein, does it mean that the cell and all its descendants can no longer produce that protein properly? If so, finding a way to recombine the broken strands (at the proper places!) would probably be a key in anti-aging research. (It is, and is a key concept of gene therapy.) (Raised by the question of whether Dolly the cloned sheep would ultimately die of old age while still young because her source DNA was old.) (Also, from the BioMedia TV program on DNA, there is a cell mechanism for repairing DNA in which mistakes occur as it's being duplicated. Example: If an "A" accidentally pairs with a "C", there is an enzyme that can pull it apart and correct the error.
  • The "BioMedia" TV programs I refer to are an excellent and understandable telecourse series that I saw on PBS, called Unseen Life On Earth: An Introduction To Microbiology, hosted by Jim Leinfelder. Here is one link to a site about it, called Microbe World® Beta. (Since it's "beta", I guess that means it's still evolving?) A web search on the show title will discover that the series appears to be available commercially, as well as in some libraries.)

 

Genetic Algorithms

  • A program that somehow facilitated the coding of problems into a format that made it possible to work on them using genetic algorithms would be a good commercial application.

Backtracking (from AI) is a method of organizing the functional subunits that will form a solution. Genetic algorithms seem to be a competing method. Each is an intelligence strategy, in competition with other strategies.

  1. Backtracking: assembles chain of subunits one at a time, selecting each for its probable fitness, based on some test. Works best when the problem is well defined so the test is known to work well. That is, when the destination is a testable goal. It is poor when the space of possible solutions is huge. It also only works where a partial solution can have some testable fitness, so you know whether to keep going or backtrack.
     
  2. Genetic: assembles subunits randomly into prototype solutions, and pits them against the problem. Best prototypes swap subunit-subassemblies in search of increasingly better solutions. Useful for "stretching feelers" into poorly-defined problems where it's difficult to anticipate what shape any solution might take, and where the space of possible solutions is huge.

Basic Computing Instructions: Opcodes

When I was developing "functional subunits" for the agents in Adapt.cpp, I realized at one point that the basic instructions that the agents seemed to need were similar to instructions in CPU instruction sets.

On reviewing those sets, it further appeared that most of the "computations" that I could think of that were performed by us or by other living things could be described in terms of these basic instructions.

It was a revelation to me that this set of instructions, which I would have considered an artificial one, might describe fundamental units of computation, regardless of context.

In the opcode list below, the string manipulation instructions and some others are omitted because they seem overly dependent on the computer context.

Other instructions are included because some seemingly "naturally occurring" computations can't be done without them, but not because they themselves seem to be inherent.

Still others are included just because they seem useful.

Most are extremely "elemental". The TI59 (TI-59 Programmable Calculator) opcodes are more self-contained (less dependent on each other), and proved somewhat more useful as a model for Adapt.cpp.

Math functions:

  • ADD Add
  • SUB Subtract: The basic computation for comparison.
  • MUL Multiply (not basic, and definable in terms of ADD)
  • DIV Divide (not basic, and definable in terms of SUB)
  • ABS Absolute value (magnitude)
  • INC Increment (a variant of ADD)
  • DEC Decrement (a variant of SUB)
  • CMP Compare two values: Is the bear who is chasing me going faster than I am?
  • NEG Negate (unary minus)

Logical functions:

  • AND Logical AND
  • OR
  • NOT
  • XOR (not basic)

Program flow:

  • JUMP Alter flow, possibly depending on a condition.
  • LOOP DO something UNTIL a condition is reached or WHILE a condition exists.
  • Data Manipulation:
  • XCHG Exchange two values (not basic)
  • MOV Move data from one location to another (not basic)

Somewhat related to this are the elemental transformations that might be performed upon a data set, such as insert a step, delete a step, etc. Many of these are implemented in Adapt.cpp. See Typogenetic Code (Godel Escher Bach:508ff) for ideas for others.

Using Opcodes

Opcodes must be designed such that they can be randomly shuffled through the genetic algorithm, and yet still always make enough sense that they don't crash the parent program, even if they are nonsense in the language they're written in. Not a trivial problem; time consuming.

When creating an opcode set, there is a trade-off involving the size of the subunits:

If the subunits are large, a solution can evolve quickly (because there aren't many different combinations to try out), but the program won't be very flexible, because the individual steps in the subunits can't be broken out and rearranged.

If the subunits are very small, the program can be extremely flexible, but a solution might take forever to evolve.

If you have any idea of what might constitute a solution to the problem, you must design the opcodes such that that solution could evolve in a reasonable time.

For example, using 100 different opcodes, a 13 step program is one of 1026 possible. It is absurd to expect it ever to evolve: at a million cells per day, it will take about 274 quadrillion years.

One possible compromise might be to use low level opcodes, but to place them into initial random programs in known-good (or suspected good) groups, as though you were actually using larger, more capable opcodes.

Thus, you get a head start (doesn't take forever to evolve), but the steps are still mutable at a low level.

This might also help with another similar problem, the relative size of mutations. Just as you can only shuffle whole opcodes, you can also only mutate whole opcodes, which makes all mutations large.

If the system is far from a solution, large mutations may help since anything new is likely to be better, and large steps are better still.

But if the system is already close to a solution, they will likely just tear it to pieces since anything new is likely to be bad, and large steps are even worse! (as happened with the "mybest" program).

For fine tuning, small mutations must be possible. Interesting idea is to change mutation size based on a measure of existing success. (What measure?)

ADAPT.DOC (Section of COMPLEX.DOC) - Project notes on the C++ program Adapt.cpp

The purpose of Adapt.cpp (MSDOS) and WAdapt.cpp (Windows) is to provide a laboratory for studying natural selection and evolution not just as processes that supposedly occur in living populations, but as a fundamental natural law that describes the process of change in any pattern, living or not, that exists within any environment; and especially for studying how differences in various factors (elemental subunit size, genetic mating method, mutation rate, gene segmenting scheme) affect the efficiency of the process.

To whatever degree it succeeds, it may simulate biological evolution (in that the cells aren't biological), but with regard to the broader process, it demonstrates it. That is, it isn't a simulator.

Basic questions are:

  1. When patterns exist that can either be destroyed or continue to exist (with or without changes), do some patterns survive more surely than others?
     
  2. Are the surviving patterns shaped by the environment?
     
  3. When the rules of the environment change, do the surviving patterns change in directions that correlate with the rules, reproducibly and predictably?

The experimental answer to all is emphatically Yes.

What's more interesting is how the various factors contribute, either enabling or inhibiting it.

Technical Notes and Comments

An explicit same-species test is meaningless. Don't use it. Even in biology, the test is solely pragmatic: Are offspring produced? Do they live?

In computer, no offspring = objects are so different that no meaningful merger is possible (string DNA vs array of 5 doubles).

And do they live = after creation, do they do anything other than get depleted by runcost? Which is a behavioral test.

There's no way to determine species by examining DNA. (That is, there's no way to tell whether the offspring of 2 cells will be viable until you've created and tested it.)

2 cells with significantly different DNA can have functionally equivalent behavior. Any differing pgm steps that aren't particularly significant are analogous to eye color, etc.

Mutation rate is now based on pgm length (quantity of genetic material available to mutate). Otherwise, shorter pgms have a much higher rate, and longer pgms hardly mutate at all. This should be a rule for all programs using mutation.

This program displays two frustrating characteristics that I thought were weaknesses and flaws in the program design, but that I later heard discussed by Stephen Jay Gould in an interview as being characteristic of natural evolution:

  1. Punctuated equilibrium: there is not a smooth evolutionary progression. Rather, a certain grouping of traits may be very stable for a long time, then completely replaced by another grouping. The replacement often is, but need not be, better by any logical measure. It is possible for pure chance to wipe out a population that looks objectively more robust and had no flaw except to be in the wrong place at the wrong time. (See #2.)
     
  2. Lack of directionality: the evolutionary progression is not necessarily in the direction of greater fitness. It is possible for a less-fit pattern of genes to displace one that's objectively more capable or robust. Gould described this as one of the main misconceptions that people have about evolution: that it somehow was aimed at, and peaks at: a) humans, or b) greater complexity, or c) greatest fitness. Evolution, in fact, is just "change over time"; nothing else is implied.

There must be a net energy increase into the system for the population to grow.

With no "plants" to eat, the easiest method is to give children more energy than they cost.

But excessive growth generates numerous cells that get culled before they can run their programs, which wastes time.

The network of energy inputs and drains seems to be important in any system. If population can't be stabilized here, add "Total System Energy" to reports or output to energy.log for Excel study and graphing.

The energy in/outflow balance here is probably like lambda, determining population behavior (death, growth, & how interesting the behavior is).

If necessary, replace artificial culling and topping up with an automatic adjustment so I don't have to fiddle with the parameters.

  • Overpopulation: increase runcost (because you don't want to stop child production, but do want more room, so create it by making cells die quicker).
     
  • Underpopulation: decrease childcost (because you want more children, not longer lifespans for existing cells).
     
  • Another possibility is to make all of them inheritable and evolve values, selecting for population stability, but this is a lot of trouble, and shouldn't be necessary.

Adapt already selects very well for cell programs that exploit the rules, and changes in the rules produce almost immediate changes in the programs.

In one long run, 2 species developed.

  1. One survived by individuals eating, staying strong, and living essentially forever, but they produced few children.
  2. The others didn't eat much, but their genes survived because they reproduced prolifically.

Other runs have been overly lopsided, producing only eaters or only maters.

  1. The maters were produced when mating didn't weaken the parents, so it didn't matter that they didn't eat. They produced plenty of children before they died. The eaters were produced when a depleted array was automatically filled with clones. They didn't have to reproduce. It was done for them periodically.
     
  2. In another run, cells reached a point where they were so adept at finding and eating each other that after a long period of stability, the population crashed. When dib size was increased (which decreased the population density), population climbed again, until program length increased with additional move instructions (which gave a cell greater range in one run() call, effectively giving them longer legs), and they once again ate the population downward.

Partial rewards are at least as important here as in Classify. If the environment or the rules are so unforgiving that a cell is either totally fit or dead, it's the same all-or-nothing problem. There won't be any evolution. It works best to the extent that less than perfectly fit cells can at least squeak by better than some others, i.e. a partial or graded reward system.

The problem with killing off an old cell to make room for each new one was that since each new cell got one chance to run its program, it potentially allowed a single loop to last forever, with an endless chain of new children. No longer a problem because of energy threshold required for mating and random selection of which cell to run.

MINPOP should be low. Periodic population crashes make accumulating energy useful, for surviving the lean periods, and the lower the population goes, the more severely you select for the best accumulators.

A denser screen makes the "adjacent" opcodes effective.

I think there are now entire programming languages (such as for defining objects used in computer games) dedicated to the types of things this program uses: objects moving around in space, governed by rules of interaction with the environment and other objects. And event-driven. One game even boasts of intelligent adversaries, implying the objects can learn. Check into it.

All heading and other calculations use the correct spatial coordinates. Heading is only translated to match the inverted Y axis in reports to user. Other things to remain aware of: when testing for neighbors, Above is calculated as (y - 1), which is the point above it on the screen. Potential problem? Currently, think not.

There must be no extra cost for moving and turning, and no per-step cost: Since all cells start with the same energy, if a brand new cell attacks another brand new cell, it will always lose and be eaten because of the extra cost of getting there, and the movers, who have the only decent chance (at least in the starting population) die out. Cells that don't eat will now weaken at the same rate, except that cells that execute many useless opcodes have a disadvantage, as they should, because they don't accomplish as much in the allotted time (or stepcount).

Opcodes

Every opcode should leave, as a result code in ax, the most useful information it has about what it just did, formatted (ranged) to be most easily used for program branching control in the case that the opcode is followed by an ax test instruction.

Remember to add sense() at end of any case that might have caused changes to the environment that are detectable by sense().

"if(test) repeat previous step" is not used because there's no way to change the data being tested, and an unconditional "repeat" is just an endless loop.

You can use both "elemental" opcodes and "macro" opcodes that could be defined in terms of the elemental ones. Cells behave more interestingly when the macro opcodes are available.

Organization of Genetic Material

Agent programs are now segmented, modeled after the segmenting of chromosomes into genes, containing embedded delimiters to allow steps of a sequence to tend to cling together and get copied as a unit.

This is necessary if insertions and deletions are allowed, because the same sequence may start at different locations in 2 cells, and step-by-step copying won't reproduce it. If program steps occupy fixed locations, this will happen automatically due to a good sequence's prevalence in the population. But then pgm must start at maximum size from the beginning, which, besides being wasteful, is unrealistic.

My attempts at manually creating "good" cell programs, before segmenting was instituted, were torn apart by mutations. However good the program, having to mate with less effective ones ruins it. This presents an interesting problem: a program can be effective, yet not robust enough to survive. Is there in general less precision and more redundancy in high mutation-producing environments or in populations with much variation?

One changed step can turn a good program into a useless one, so the fact that two programs look similar doesn't tell you anything about their relative fitness. It also means that there may be no steady progression toward improved fitness because there is no reward for a partially good solution. A program could thrash around "near" a good solution for quite a while, with no improvement at all. Then the last essential step falls into place, and it's a terrific program. Do a long program run from scratch (starting with one program step). The nature and progression of programs evolved from scratch may be different from those evolved starting from, say, 10 steps. More redundancy? Hardier sequences?

One of the surprises from the Human Genome Project was the discovery that 50% of human DNA appears to be useless filler, parasitic hitchhiker DNA segments with no function except to preserve themselves in our DNA. They replicate and copy themselves to new locations. This would be consistent with the direction Adapt programs seemed to tend to go: when there's no overseeing organizer, you can wind up with redundancy, filler, obsolete sequences no longer used (i.e. no longer expressed, but that could become active again under different circumstances), all commingled in a hodge-podge mess. This, in turn, is consistent with the discovery that a gene can stop being expressed if a "stop" codon gets inserted into it, blocking its functionality. It's how traits go inactive. Example: human smell is currently 1/1000 as acute as it once was, but the coding for it is still there and could be reactivated.

It is remarkable that a program as crude and unfinished as Adapt already shows parallels to so many natural phenomena like this. It certainly is generally on the right track.

In brief reviews of programs, it looks like unnecessary zeroes (not actually useful in defining segments) do get squeezed out of programs.

Find out: in sexual reproduction, at what level does the mix-and-match occur? Whole chromosomes only? Seems likely, which means you get huge blocks of DNA from each parent, not thousands of little segments cut and pasted together.

Ideas for Changes

Program Operation

it could also have a stack (a stack container of doubles) to push ax onto for memory, or at least one additional test register. Or ax could be a StatArray, not sure what for!

As long as any cell eats any other, overpopulation presents no actual food problem (just the opposite!), so don't try to simulate one (such as with Cull). It's not the real situation.

Agent Behavioral Rules

  • consider rules that might cause cooperation to be advantageous.

    Try to develop rules that lead to predators vs. herding, hiding, fleeing cells.

    Should be able to modify program to create a flock of "boids" using only a few opcodes.

    The boids rules are (from PBS TV show: Math, Practical Purposes): 1) follow the predetermined path specified for the flock as a whole, 2) Get as close as possible to the path(?) and to the center of the flock, 3) Don't collide with each other or other objects.

    Maybe also introduce new environmental factors (that I can change easily), and opcodes that implement inheritable variations in how different cells react to them. e.g. temperature, dark/light, time of day, season, salinity, territoriality, nutrients other than each other, metabolism of them, symbiosis (use of each other for reasons other than eat/mate; e.g. immobiles hitching a ride). Continue making additional program constants and variables accessible from Options dialog. The most interesting feature of the program is how agents respond to changes in the constants and rules, which involves tinkering with the program. (Probably always will; any distribution would have to be in source code form, for experimenting with.)

    Seems like there would be an inherent advantage in a 2-sex system, with 2 separate populations competing within themselves separately from the competition within the species. Base sex on whether cell's id# is even or odd: also easy to tell, when viewing data.

    Preferences for whom to pick as a mate and whom to eat or not seem intuitively important, not just to the survival of the individual, but to the development and survival of a species. It looks like these rules should apply (in Adapt, and maybe more generally). An agent (or species) is better off if each agent:

  • Avoids being eaten.
  • Eats to survive or produces children quickly. (Be long-lived or prolific: the pattern must survive)
  • Avoids eating suitable mates.
  • Produces children of equal or better survivability (especially when there is a cost to producing children):
    • Avoids eating its own children.
    • Chooses a mate. Good criteria:
      • Mate with stronger other.
      • Mate with most similar to avoid corruption of your program. (But maybe not: program corruption is always a possibility, but by mating with stronger, the likelihood is that the "corrupting" influence will be for the better for your children.)
      • Mate with other that already has many children - propensity to propagate.

Once program is fully operational: do successful programs exhibit these characteristics?

Results and Analysis

Chi-squared analysis should be particularly suitable for ending populations, since pgm steps are assigned randomly. At each pgm location; for opcode sequences; for segment sequences.

Run 4 windows (populations) from either identical or random starting populations. Do they evolve in similar directions?

There's an interesting energy economy in the program, with characteristics similar to a monetary economy. Energy tends to concentrate. Those who have it get more, and become unbeatable. Each cell is a little factory that consumes other cells as food and transforms their energy into a product (a cell more like itself than the one it ate!). The program should function well as a closed system, but tends not to. It is more robust if there is a net flow into it (inflation), especially since energy that gets concentrated is effectively removed, unavailable, and must be replaced. Energy flow (analogous to money velocity) is important to its dynamism; without runcost and complementary energy inputs, a population of cells could last indefinitely, but doing nothing, no growth, movement, change, evolution.

[End of ADAPT.DOC]

Up 1-Learning 2-Patterns 3-Real Life 5-Classifiers 6-Biology 7-Neural Nets 8-Connectionism 9-ALife 10-AI

 

Valid HTML 4.01 Transitional Valid CSS
View content labeling at ICRA.
Copyright ©2008 Steven Whitney. Last modified 10/09/2008.