|
25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more... |
Home Projects Up Sitemap Search Blog Forum+Chat About Us Privacy Terms of Use Feedback FAQ Images Services Payments Humor Music |
|
Intro Learning Patterns Real Life Genetics Classifiers Biology Neural Nets Connectionism Life AI Essays on Complex Systems Part 5 - Classifier SystemsClassifier Systems (Section of COMPLEX.DOC)Remember that a classifier system is a particular 1) rule-based learning method that happens to incorporate a 2) genetic algorithm and 3) bucket brigade passback of environmental feedback. Any of these three components can be used independently. Two ideas that I should consider as potential alternatives to the rule-based method are the future-seeking of Tictac.cpp and the DNA-like idea later in this section. Co-Evolution of Message SequencesTo test (or develop) sequences of actions, only allow 1 rule-posted bulletin board message on the board at a time? (You can only do 1 thing at a time.) I.e., a rule can only depend on the thing just done (message just posted previously), NOT "anything within the last 5 messages", the situation now. ----- The thing that interests me most (and thus, the goal of the classifier should be) is to see tiny program subunits (elemental actions) form sequences that in turn serve as building blocks for sequences at the next larger scale. This was one goal of Adapt.cpp, and would be very useful in Talk.cpp. (see next section) (Vital) Role of the Environment"You can't do THIS until you've done THAT first" is an important concept. This may more properly be the responsibility of the environment than of the learning entity. That is, if "X" isn't done, then the environment can't allow "Y", and that becomes how the program learns the sequence. It's also the way the world really works. It's not that "we know" that we can't walk through the door until we've opened it first. It's that the environment doesn't let us do it any other way. In other words, instead of trying to build sophisticated world knowledge into the learner, you need to model the environment realistically. Our learning methods may not necessarily have to be very complicated. We may have simple methods, but are shaped by a complicated environment. Instead of a "program" that I try to make as "smart" as possible, there are now two entities in the program, the learner and the environment, each solely responsible for a different part of the interaction between them. Conceptually, it's a lot easier to think about a modeling a realistic environment than to try to figure out what kind of "world knowledge" the program "should" have. You just ask yourself, "What is the world really like?" If the environment doesn't reward for Y unless X preceded it, then that sequence should develop naturally, because lone Y's will go unrewarded, but Y's that follow X's will be rewarded, and will pay their 10% passback fee to the preceding X, so X and Y have to develop together. Always think of actions (instructions) in terms of goals that can be incrementally rewarded. For example, instead of "grasp knob" (in the thermostat example), the basic instruction should be "decrease distance between hand and knob", and its use in a loop would be: while(there is
distance between hand and knob) "Grasping the knob" is a task that random flailing would be extremely unlikely to accomplish, and it can't be incrementally rewarded. "Decreasing the distance" is something that random flailing can do by accident, and the rewards for accomplishing that subtask would incrementally lead the agent toward the goal of contacting the knob. It's also much more useful because it can be generalized to decrease the distance to anything. When scoring a potential solution, try to provide as much information as possible about why or how the solution is incorrect. The current % grade works well, but later you could also encode such things as "solution too long" or too short, whatever type of hints a teacher might give to a student in addition to a grade. The classifier will learn more quickly if it can make use of this additional information while generating its next solution. Notes and plans from my classifier system C++ program Classify.cppTechnical Notes and CommentsAnteSubtracting ante, strength lost each iteration as a "drain" for bid values, was removed from program a long time ago. Its supposed justification, noted in program, was this: allows obsolete rules to wither: analog of "forgetting". Without this, a rule that attains a high bid value will keep its value forever even if it never gets to post another message. (and if it never posts, it's useless). The deletion of rules whose bid is depleted to zero was also removed. Not sure whether ante actually proved useless. The current use of the Classifiers in time trials, etc., requires that run() be exited as soon as a solution is found, so no rule actually gets very strong; that may be the only thing that made ante unnecessary. Each round, make only bidders pay the ante to prevent useful rules whose stimulus hasn't appeared for a while from withering too quickly? (No, must be all rules). Currently, rules are down to 1 in 32766 iterations, which could easily occur while learning a new training set which doesn't involve them. That's ok. With current concept, it just means the information was forgotten, and must now be relearned. (could exact the ante only once every 100 or 1000 iterations, or rotate problems to keep rules strong). Payments and PassbackWith my % payment method, a rule benefits more from being a "supplier" to a rich rule than to a poor one (appropriate and realistic). A system that used "price supports" for partial correctness and an (equal) bonus for being a sensor-responder got the solution quickly without any successreward! (and then lost it) Always pay (including sensors) option in run() is now turned off: when it is on, a rule that has grown strong, but has become obsolete will pay its 1 / PASSBACK amount (although to no one), and be depleted, but too quickly because it is depleted by ANTE also. Also, in a dead system (with no sensor-responders), if a rule develops that does respond to the sensor, it's immediately depleted by the payment, even though it's the only rule showing any promise. Its lower bid makes it less likely to produce offspring that also respond to the sensor. When interpreted numerically, the presence of dontcare in stimulus can inherently code < and > operators: (it appears the current system can't code for "not equal to" or "greater than") #######0 even (multiple of 2) ######00 divisible by (multiple of) 4 #######1 odd 00000100 = 4 000000## < 4 #####1## >= 4 Each winner pays the rule responsible for having posted its stimulus, but the payoff doesn't go back any farther than that until the next time that sequence is triggered (so in a 3-rule posting sequence with a big reward, the initial poster doesn't get any of that reward until the third time the sequence occurs, though it got its smaller reward each time from the subsequent rule). This is true both in the book (Complexity) and here. I did change it so that now a rule pays its supplier after receiving its successreward, so it goes back one generation immediately. A more important problem should pay a larger successreward, so rules that address that problem will score higher. (For the case where there are multiple problems available to solve in sequence.) It is passback that allows moving (seemingly) farther away from your goal (making current conditions worse) in order to set up an action that reaches it. The setup action must be rewarded. I think I've been tempted at times to just pass back the reward at time of payoff, but you can't do that. If an end-of-chain rule gets a big reward, it would never propagate beyond the rule just prior. You must pass back a % of total strength, at lottery-win time. But you could additionally pass back the reward at time of payoff. That would accelerate the passback process. (I think that is the current procedure in Classify.)
Bid and Bid BalancingAbout bid: negative values won't ever have use because they're disastrous, anyway! the most that ->bid should be is an unsigned. If it is any type of long, the total of all of them could overflow a long or unsigned long. And if it is made unsigned, you'll need to write ulrandom() to calculate and return an unsigned long. Then a number of other variables in the program will need changing or casts. A rule can only be 7/8 or .875 correct without being 100% (7 of the 8 positions correct). This is probably related to the most effective initialbid value being around 85 (when successreward is 110). Any higher, and new untested rules have a higher chance of being chosen than ones whose solution is only 1 char off. Any lower, and new rules that might contain the answer have less chance of being chosen than partially correct rules that are far from the solution. initialbid must be high enough that new rules have a chance against the "price support" set. otherwise there was no point in creating them. And initialbid + successreward should be enough to bump over 100, so a winning rule survives Sweep(). Stimulus and Response StringsThe current use of strings for everything should be sufficient for a long time. If you allow any chars 0-255 in the string, (possibly using them as short ints), very complex environments can be represented and complex rules can be created. Another note suggests examining more closely what the don't care bits code for, and looking for more efficient ways to code the same things without resorting to the binary digits, but it's probably not desirable. At least for now, it's not the numeric significance you're after. It's the pattern matching ability. When designing problem sets, remember problem and solution will no longer have to be the same length, nor do stimulus and response strings. So you can pose problem "11" and look for solution "1". Bulletin BoardSeems inefficient that with each pass, bulletin board only gets one new posting, yet all rules rebid on the whole board again. Consider carefully. Maybe ok, because there may have been new rules added(?). Misc (not yet moved to other categories)The classifier system in general rewards efficiency. If a rule appears that can alone do the job previously done by a sequence of rules, it displaces the sequence. Things to measure: does the system learn faster than an exhaustive search would? Yes. Stim & resp together are 16 binary bits, which would require 65536 iterations to find. A major goal should be to allow plugging in Rules, Bulletinboardentries, and Environments of any structures with the least fiddling. Just as an interesting note, a rule can be triggered by its own previous posting (a self-responder), in which case it will pay itself. Only a toy problem requires the Problem and Environment classes. In a real use, the problem is a real environmental state, and it has no right answer. The feedback (scoring) comes from the user or must be scored by the Classifier based on whether conditions improve. Ideas for ChangesTo match the book (Complexity), extend Rule.stimulus to 16 chars to allow AND
processing. This will require 2 source fields for each Rule, 1 for each triggering part of the
AND. (See Complexity:188) The ability to process one AND automatically allows processing
of AND clauses of unlimited length, up to size of [BulletinBoard - 1], which leaves room for 1
posting actually doing something useful: Before implementing any features involving variable size strings, get the current version in a basically stable and final form. This and some other proposed changes may be wandering so far from the original classifier idea that they should be implemented in completely new programs, or kept merely as ideas for future classifier programs that have a specific use. In spite of examples in the book, I'm still unclear exactly what uses a classifier might be good for, and the idea that this program might ever be generalized sufficiently to serve as a basis for a general classifier module is looking increasingly remote. That is, the component pieces for a useful classifier system for a particular application may have to be so peculiar to the application that the use of any generalized model is impossible. This one is very good basically at string matching, and it seems that you should be able to make it very useful by choosing what is encoded in the strings, but so far I'm short of any practical ideas. Remember that a classifier system is a particular rule-based learning method that happens to incorporate a genetic algorithm and bucket brigade passback. Any of these three components can be used independently. Figure out exactly which data elements must have matching formats, and how to compare and score them. Remember that if variable length strings are allowed, dontcares could creep into response strings, which for now is illegal. New format: the environment status string is a bulletin board posting; it can be any length. The solution can be any length. Rule stimulus strings should start small, but be expandable also to any length. Rule response strings should start small, but be expandable also to any length. When a rule searches for its stimulus string, it begins at position 0 of each message, and compares only to the length of its stimulus string. That is, if its stimulus string is shorter than the posting, it CAN be triggered by the initial segment of the posting (ignoring the rest). But if its stimulus string is LONGER than the posting, a match is impossible. That is, the rule requires more data than is available in the environment. Should the proposed solution (which is also a posting) have to match exactly with the correct solution, in length and in each position?, or can it merely match up to the length of the actual solution? Allow the 2 parents to be different lengths and allow a child to have a different length from the parents. This (what?) may require a different comparison routine when searching bulletin board. All the strings must have mechanisms for lengthening and shortening. Could lengthen by mutating it at random(length+1). If it lands at the position 1 beyond the end, you lengthen it by 1. So tendency to lengthen will decrease as length increases. When any new string is created, any trailing dontcares are stripped, which can result in a gradual shortening. To have a memory of what it did, save scrolled messages to disk? Multi-level classifier? Current system has 1 bulletin board. Could have more than one bulletin board. Each rule would be assigned a board to search and a board to make its postings to. (Not necessary. If the expandable-environment above were implemented, instead of having multiple bulletin boards, you'd still have only one, but many rules would only pay attention to a portion of it.) This would be a lot like a very flexible neural net, except that the synapses, instead of consisting of the weights, would be the bulletin boards, and a neuron would fire, instead of when its threshold was exceeded, when it found its stimulus on the board. Might need to use one of the bits in message postings to indicate the source (the classifier itself [rule-posted] or the environment). This is for the benefit of the Classifier, not the rules, so it can tell self from other. Link classifier systems into a network? One's output is another's input. Like a neural net, except each node is an entire classifier system, and thus hopefully more capable. Currently, a rule's stimulus and response are independent of each other. Would there be any advantage to creating (a type of?) rules whose response is a variant of its stimulus, such as having all the bits inverted? The intent is to solve the old "inversion" problem (from neural.cpp): to invert an input, there has to be a rule that specifically outputs the inverted input; so to invert ALL inputs, there has to be such a rule for every possible input. If a rule could just output translations of its input, it could address many more situations. However, at the time of making this note, I can't remember if this idea stays within the "spirit" of the classifier system itself, so consider it with caution. Remember the idea of allowing wildcards in the response string also: the wildcard slots would be filled by the corresponding chars in the message that ->stimulus matched. Ex: if "11111110" was posted, it would match "#######0"; if that caused the response "#######1" to be posted, it would actually be posted as "11111111", taking the first 7 chars from the original posted message. This would allow rules such as (from the previous example): "if the original message is even, add 1". I thought this would be useful, but can't remember why. Seems like it would add a lot of flexibility. If you had a type of rule that could modify an existing bulletin board message, then if the response string had don't cares in it, it could perhaps lay a pattern into the message. That is, it would leave alone any message bits that corresponded to its own don't cares, but would change all the rest.
Rules that merely modified existing messages might be a very useful addition, but for a very advanced version of the program. First make sure that this can't already be accomplished by the current scheme. I don't think it can in any flexible way. That is, you'd need a separate rule for every possibility, which misses the point. ---------- I'm also intrigued by the idea of enzymes that snake their way along DNA, replicating it. Could there be types of rules that do the same sort of thing? It would go along the environment (problem) string looking for a pattern that signals it to start processing there, then continue along the string, perhaps modifying, in place, what it finds there (similar to the above paragraph) or replicating a portion of it to post as a new shorter message. That is, if it finds 00010 in the long message it is searching, it simply posts the 00010 as its own message. This would result in effectively translating the 00010 to a new location where all its bits would have new meanings. This sounds very powerful, but it's so abstract, I can't think of why, or of examples of what it would be useful for. (A disadvantage of in-place editing is that the original string gets destroyed and is lost.) Currently, bulletin board postings are just dead notes. These ideas would give them a more active role in the system as information holders and transmitters. In fact, this might make them the information holders. The rules just become enzymes that create, modify, duplicate, and carry out the instructions coded in the messages. This scenario would finally allow in-place arithmetic of variables (any kind of manipulation you like) in something like a stock market analysis system (where you don't know what ratios or formulas you want to consider, and where in fact the whole point is to discover which ones you should consider, without having to invent, generate, and test each one yourself), or in the calculations that Adapt cells do, but need an entire "functional subunit" language to accomplish and in which each variable or manipulation of it has to be explicitly given an opcode. In this system, the functional subunits are the capabilities of the rules, the enzymes. One rule might be able to add two adjacent numbers together, or multiply, or whatever. Or find a marker, read the number there, find another marker, and multiply, in place, what's there, by the first number. This also fits in with the idea somewhere else of using one contiguous area of memory for storing everything, and interpreting the bytes found there in arbitrary ways, except now you could have marker bytes used as separators or identifiers. ("Attention: the next byte is the beginning of a number originally stored as an integer!" ) You could allow the markers to migrate, causing the bytes to be interpreted differently. Write a new program to develop the basic idea. Read carefully Godel Escher Bach:508ff before starting. Big data area, and enzyme-rules that search through it and modify it as they go. This is something like a cellular automaton, except it's not cellular. Instead of positional rules, the rules are these active agents that roam around. And if the agents don't always become active in the same order, you have a robust unpredictable system, not automaton-like (similar to fractal generation, where you must choose the transforms randomly or you only get a few dots). In effect, the enzyme-snaking system can be described as a classifier system. The blocking enzyme is the "if" clause, the stretch of DNA that it blocks is the "then" clause. Together, they form a rule: "IF the chemical that strips me out of my blocking position is present, THEN allow this protein to be made." The bulletin board is the surrounding soup of chemicals, which are the messages. At any given time, various messages are either stripping off the blocking enzymes, or not, allowing, or not allowing, the various THEN clauses to be executed, allowing, or not allowing, the various proteins to be made. Each protein is a new posted message, which could in turn trigger additional rules to activate. May need these 2 functions:
Try it on the "thermostat" problem (see Adapt.cpp ). Similar to pipeline problem. Try it on a simulated maze (see page 190). Try it on "word matching": output the same word it's given (variable-length). Try taking out all rewards and penalties & just watch what happens? Something like Echo: there isn't any "solution", just that the survivors survive. Routine for translating the bulletin board into a single output? One approach could "OR" each message with the output string, so the output string would be an "OR" of all the messages. Once (if) there's a translating routine, ... two approaches to using the bulletin board:
The system planned for Talk.cpp may serve as a guide: If the output sentences are composed of consecutive bb postings, then the order of posting is significant, which means that each token in the word is actually a solution: i.e. a "word" wins its place as next in line, and becomes the "sensory input" (possible stimulus) for the next round of bidding. In this case, the bulletin board need only have 1 slot for rule-postings (in addition to sensor postings). But since it is the resulting SENTENCE that is judged for coherence, you can't assign rewards until the sentence is finished: which is after quite a few "solutions" have been posted. The system has to have a memory that is longer than "one solution long", so it can pass the rewards, if any, back. Is this right? Is it already provided-for in the current system? Also consider: what are the implications of the FIRST plan for the system, which is to have all rules post messages until board is full, then post that as the answer. That would be a lot easier to handle. In current version, the Environment is really the teacher, and could be renamed as such. run() is really train(), since the goal is to produce a set of useful rules. It's only after training that the set can be put to use. Still confused how this training version relates to a future version that changes environment but doesn't really get scores from it. You could train it first, then put it to use, like a neural net, but it should be able to continue learning after its initial training, meaning there really shouldn't be any difference between the two. Two modes? If it doesn't get rewarded, or doesn't get what it wants or expects, go into training mode? During training, it gets rewarded for correct responses. Should it develop a memory of the expected reward that a response should receive? This is edging toward the Tictac model. It's also realistic. You do develop an idea of what the reward should be in a given situation. If the response doesn't bring it, you can figure that something has changed, and your learning has to be revised. Tic tac toe version: Rules will have to consist of possible board positions as stimuli and digits 0-8 as responses. tictactoe involves consecutive moves. Each move will be the result of a bidding auction. You need a reward for legal moves, so that they are selected for (to teach it the game). Or, if a rule codes for an illegal move, you can just kill it immediately. Given the above setup, the only illegal move it can make is to try to play its piece into an already occupied square. But then at the end of the game, you need to reward the final move and the moves that led up to it. That is, being a legal move is good, but being a strategically good legal move is better. And you must reward all the moves that were made during the game that led to the final move. Since the bulletin board only contains postings competing for the final move, you need an array (a memory) of consecutive winning rules (the winners, which resulted in moves in the game). This is probably a good addition anyway. Then for a win, you can either reward all the game moves equally, or use passback through the array elements. This concept is very far removed from the current version, and should be a separate program, in the Tictac subdirectory. ---------- Pipeline program description implied some features I haven't used here. Read it more carefully. Also overall Classifier description. E.g. book says, "rewarding the classifiers active at the moment of payoff". I only reward one. If the chars in the strings are treated as consecutive pixels, the system might be able to do basic visual processing.
Comparison of genetic algorithms, crossover vs position-by-positionmate() only produces 1 child. In the book (Complexity), 2 crossover children are produced. There, the first child gets its starting sequence from 1 parent and its trailing sequence from the second. For the second child it's reversed. The reason I chose only 1 is that so many children are produced. While it is unlikely that mate() will go on to produce exactly the 2nd child that the book would have generated, there is no reason to believe that the book's twin would necessarily have been better than the next child produced here instead (with the next mate() call). It is likely that at least the strong parent will eventually pass both its starting and trailing sequences to a child. The book uses crossover genetic method after concatenating stimulus and response. This (coincidentally or on purpose?) results in a child receiving either stimulus or response wholly copied from one parent, with the other string being spliced from the two parents. My version copies each position in each string from one parent or the other, which lets stimulus and response evolve separately from each other, and allows the (remote) possibility of a child being identical to one parent. In general, what you're trying to do is find clusters of values that work well together, and this position-by-position choosing would seem to thwart that by effectively creating too many cutpoints, making it highly likely that such clusters would be separated from each other. The reason this should be ok, however, is that any such clusters that do work well should spread through the population, and thus any chosen parents are likely to have that cluster at those positions. In those cases, it doesn't matter which parent you choose the gene from, since both parents are identical in those locations. There is no observed learning speed difference between the book's method and mine, and I like mine better. If this isn't already noted in Adapt, make sure it's moved there, if it is ever deleted here. ---------- Both the Rule and Classifier classes now have members and an overall organization that can serve as a guide in creating any class that you want to be evolvable. In evolve(), too-easy problem sets cause ties, which seems to allow disadvantageous mutations to creep into the set (and sometimes take over), degrading times for subsequent sets. See what happens over a long run. Could delete a Classifier only after it has clearly lost (tie breaker?), or delete 1 at a time, as they lose, refilling only when set is down to 2, but either introduces new complications. ConceptualizationsThe basic classifier system seems to be sort of a basic "engine" or "processor", with the i/o being peripherals: you can assign any meaning to the inputs and outputs, (that is, you can independently "translate" them into whatever you want), but it has to be done carefully because the whole system has to make sense. This also raises the issue of whether the bulletin board postings are regarded as actions that are sequentially performed, or whether they are themselves only inputs to the "translation" routine, which somehow combines them into a single output action. This is a critical question for any potential application, and has to be answered before even this system can be expected to do anything more than the string matching it does now. The classifier's stimulus string is an efficient notational system for a sequence of if/then clauses, a filter: Looking at one position at a time: "if position 1 has this value, and position 2 has that value, and ... I don't care about position 3 ... and position 4 has this value, then..." This is a bunch of Ifs connected with ANDS. An AND is indicated when positions have a value. You can transform something to an AND by adding it to the filter, giving it a position in the stimulus string. (Meaning that the stimulus string (the stimulus provided by the environment) could indeed be infinitely expandable, to accommodate facts about the environment that are newly learned to be significant. After you add something to the end of the available environment, you'll have a bunch of rules that don't know about, don't concern themselves with, and can't access the tacked on portion, but they remain valid for the portion they DO address. (Strand() should still work.) And you can create new versions of these old rules by using a genetic algorithm that allows the 2 parents to be different lengths and that allows the child to have a different length from the parents. Thus the new rule could have the useful characteristics relative to the old shorter stimulus string, with the additional refinement that uses (filters) new portions of the environment that the parents couldn't deal with.) And you can transform something into an OR by making it a don't care (or removing it, but you must preserve the blank in the position it held, so the other bits keep their absolute positions. You can, however, strip all consecutive trailing don't cares from every stimulus string.) You could also periodically cull the entire database, and strip out the positions where every rule has a "don't care" in that position. The current implementation doesn't seem to allow rules to effectively react to multiple separate stimulus strings, connected by AND (E.g. only if both 1000### AND 100##000 are posted on the board, then post...). I can't think of any sequence or combination of single-stimulus responders that can reproduce that concept. However, the above idea of having very long stimulus strings takes care of it. (I.e. if you need an AND, you just tack it onto the existing environment string). Having 2 stimulus strings does allow AND to any level of complexity. That is, if 2 rules post, and each responds only when its 2 stimuli are both on the board, then you know that all 4 required postings are already there, so it can condense multiple postings into one result. This is very useful, but it complicates the reward scheme, and is merely the introduction of a "meta-level" on top of the basic system. That is, all this already occurs within (is already true of) the stimulus-response strings (they AND the individual positions within the strings, and produce a single result, the posting). This just introduces it at a new level where the units are entire strings instead of positions within a string. Having long stimulus strings instead is probably a preferable method, except that it introduces position dependence. (It doesn't just matter whether your 2 required stimuli are there, but that they be in proper order.) On the other hand, 2 rules can develop identical except for the order, which removes the position dependence problem. As possibly noted elsewhere, perhaps the bulletin board could be one long string in which individual postings are flagged with an initial "message start" marker. Messages scroll off left end, and searching for stimuli only requires string.contains(). Remember that whenever you apply a filter, you get two parts: what gets trapped by the filter and what falls through it. This is important because in a classifier the rules are really traps. It is what gets trapped by the rules that does the useful work. Oppositely, in a neural net, the trapped signals are presumably the ones that get blocked, killed off, while the ones that get through go on to do useful work. So in this way the two systems are opposite kinds of filters. Scanning the input, the classifier actively selects the useful information, while the neural net lets only the useful information into the system, by blocking what it doesn't want. Currently, there are actually 2 independent evolutions going on: the system selects for stimulus-responding rules. And it selects for rules that more nearly approximate the solution. Also see notes for Tictac.cpp about classifiers. In any reward passback system, it is much better if you can identify what really is responsible and should get the credit or blame. Doing it with a rigid passback scheme should be only a last resort if no heuristic method can be thought up. In some systems, a much different outcome can be attributed to whatever was done differently this time from other times where the other circumstances were the same. That is, attribute the different outcome to whatever was different about the circumstances or the action. [End of CLASSIFY.DOC] Goal Seeking Systems[Start of TICTAC.DOC.] Notes for the Tic Tac Toe C++ program Tictac.cpp that learns from experienceTechnical Notes and CommentsIdeas for ChangesOnce there is a library of "good states" stored, cutting them up and combining them in new ways (possibly using a genetic type method) could be a crude form of imagination, a way for the program to invent new possible good states that it has actually never experienced. Provides a way to generate new ideas (and goals) that are reality-based, rather than just generating random ones that have a high probability of being nonsense. There are 3 ways this program could be implemented:
Later, there should be "restrict/generalize" routines. These would scan the high-scoring board states in memory, determine the similarities, and thus abstract what about the positions was relevant to their high score. (This would hope to abstract series of 3 X's as the real goal.) Then the other positions in the state would be converted to "don't cares", and the resulting state should get a score that is the sum of all the states that went into creating it. (After you do this, if in a later game you reencounter a board that was consolidated, that board will again be added to master. To avoid this, should you do this consolidation at the time you add boards to master? That is, don't just check whether that board is already there, but check whether it can be consolidated into a board that is already there. How? Could we let it assume that if it won, it was because of something it did (i.e. not luck)? In that case, you could search for patterns of its own piece, ignoring the other side.) This would cut down on the number of specific states it would have to store in memory. People certainly do this. You don't store complete details of every situation. Over time you figure out, for any particular situation, which details you must pay attention to when making a judgment about the situation. If a variable you previously didn't notice turns out to be important, you add it to the list, thus refining the filter with which you view that situation. Note that most filters actually contain a HUGE amount of don't-care variables, which is why you must refine in this manner. ConceptualizationsThe Player records and scores all endgame positions (scored relative to 'x'). So its database contains, and it can refer to, results of actions that weren't its own. This is a crude form of learning by example. Before deciding on a move, it theoretically has all moves available to make. Then a future is picked, which screens out some of the moves. This is the filter. (When using the int[9] proposed above,) successive futures screen out more possibilities until only a few are left. Now consider this as a physical filter with physical screens. There are pathways through the screens. Some dead-end before they get very far, others go farther, and some (the remaining moves) go all the way through. Intuitively, this seems like a process similar to what a neural net does. That is, some impulses don't get very far (are inhibited along the way), others make it to the "end" of their circuit. What has me stumped about neural nets is how to accomplish this with backprop. This way of looking at it might help. I don't even like the idea of backprop. What else could take its place? It would be good to be able to pick exactly where a signal goes "wrong", and instead of adjusting a strength, just grow a new connection to that point which because of its chosen origin will, in that circumstance, kill the signal, but that might allow it through in other circumstances. Here there is no need for partial rewards. In a classifier, partial rewards help guide the system towards an unknown endstate by rewarding the steps along the way. Here, the endstate is envisioned from the start. The difference between my system and a classifier is that a classifier is response oriented, while mine "envisions" a future based on desirable states it has encountered, and tries to achieve it. A classifier rule says, "In this situation, that is a good thing to do." Mine says, "Given this situation, that good (based on past experience) situation is achievable, if I do this..." A classifier assigns values to rules that tend to lead to good states. Mine assigns values to states, and figures out how to get there. Goal setting and goal seeking. That may not be correct. This system may just be a disguised classifier. Given an input bulletin board posting of the current state of XOX---XOX, along with a posting that "you are O", it will play 5 because that associated future has accumulated a high score. This is the same as a classifier rule something like "O: XOX---XOX -> 5". One difference is that this system implicitly can look ahead several steps (it's looking toward the endgame board). Also, a classifier would have to code every possible stimulus, meaning every possible legal interim board, numbering in the thousands. But with its "don't care" bits, it can probably do this efficiently. It is peculiar here that the computer player doesn't actually know the rules of the game, as a human does, nor does it know what causes winning (3 in a row). This makes it poorer for tictactoe, but is more lifelike and more applicable to more complex situations. The goal seeking feature of this system seems like an advantage over a classifier, but it has a major drawback in saving only endstates. In many situations, especially where the range of possibilities is large, there aren't really any endstates to save. Every state is an intermediate one. What this system really does is save any state where it receives a reward or punishment, which in this case is an endstate. It doesn't have to be, since it wasn't in the first version, but the number of states was too large. The best compromise would be a system where it sought a previously-rewarded state if possible, but resorted to a classifier system if it couldn't find something to seek or avoid. [End of TICTAC.DOC.] |
|
|
|
|
|
|
Copyright ©2010 Steven Whitney. Last modified Thu 10/21/2010 02:08:01 -0700. |
||