Design Notes

 

Notice

Work In Progress
This is a work in progress and is not complete nor up to date. We place it here in the hopes that, even incomplete, it will be of some help.

Introduction

Let X be a time series of vectors such as the numeric vector x = #(num| xtime x1 x2 x3 ... xm), and let Y be a numeric vector of "score" values. Let both X and Y be of length N.

The time series, X, together with the "scores", Y, are used to train this learning machine. There is also a testing set, TX and TY, of vectors similar to those in X and Y but for the testing time period (a time period not present in X or Y). After training, the machine is presented with the testing set, TX, and outputs an estimator Lambda, Estimator, which attempts to estimate ty in TY when presented with the corresponding tx in TX.

Running this Estimator on every tx in TX returns a Vector EY, of estimates for TY. Each element in EY contains a numeric estimate for the corresponding value in TY. The Estimator Lambda attempts to:

  1. Make the natural ordering of EY be predictive of the natural ordering of TY
  2. Minimize the least squared error between EY and TY

The internal model used by the Estimator Lambda to compute ey is important. The range of possible X and Y training sets and the power of the possible estimates is determined by the internal model used in the Estimator Lambda.

Linear Regression

Using the technique of linear regression the Estimator Lambda could provide accurate estimates for all X and Y training sets which had been constructed from the following model, y = c0 + c1*F1(x), where c0 and c1 are internal numeric constants and F1 is an arbitrary numeric function on members of X. If the choice of F1 is known, linear regression can estimate the values of C0 and C1 quite accurately just by training on X and Y. If the choice of F1 is unknown, linear regression works very poorly.

Multiple Linear Regression

Using the technique of multiple linear regression the Estimator Lambda could provide accurate estimates for all X and Y training sets which had been constructed from the following model, y = c0 + c1*F1(x) + c2*F2(x) ... + cm*Fm(x), where c0 through cm are internal numeric constants and F1 through Fm are arbitrary numeric functions on members of X. If the choices of F1 through Fm are known, multiple linear regression can estimate the values of C0 through Cm quite accurately just by training on X and Y. If the choices of F1 through Fm are unknown, multiple linear regression works very poorly.

Abstract Regression-Classification

The Abstract Regression-Classification uses the techniques of linear regression, multiple linear regression, support vector regression and percentile grid regression. The Estimator Lambda can provide accurate estimates for all X and Y training sets which have been constructed from the following models, y = c0 + c1*E1(x) or y = c0 + c1*E1(x) + c2*E2(x) ... + cm*Em(x), where c0 through cm are internal numeric constants and E1 through Em are valid Estimator language regression expressions. Even when the choices of E1 through Em are unknown, Abstract Regression-Classification works very well, performing natural order preservation as its first priority and minimizing the least squared error as its second priority. An explanation of the power and limitations of valid Estimator regression expressions is contained in the next section.

Estimator Language

The Abstract Regression-Classification Machine evolves a population of well-formed-formulas (WFFs) in a regression-oriented grammar known as Estimator. Each WFF is known as a Estimator, and is trained to map number vectors in X and XT to numbers in Y and YT. The Estimator language is a dialect of JavaScript, to make learning and reading easy by the current generation of programmers. Each Estimator Lambda, wff, supports the following methods.

The Estimator grammar is defined, within the ARC Lambda, in arc:estimator:%DECLARATION, which is a feature-based grammar specification understood by the parseLib. The arc.estimator child Lambda is a "Estimator" parser, generated from arc:estimator:%DECLARATION by parseLib. The Estimator parser translates ASCII text strings into Estimator WFFs which are annotated s-expressions. Each WFF s-expression, in the population, may be annotated with grammar notes taken from the rules defined in arc:estimator:%DECLARATION.

Each Estimator WFF must chose one of the following regression model syntax to remain gramatically correct.

Linear Regression

         
             regress expression;         

                         for example:

             regress (x10 / sin(x12));
             

Trains a linear regression Estimator WFF for the model expression. The expression must be a valid Estimator regression expression.

Support Vector Regression

         
             svmregress(expression1, expression2,..., expressionM);         

                         for example:

             svmregress (x10,cos(x12)/log(x3));
             

Trains a support vector regression Estimator WFF, with a cubic kernel, for the models expression1 through expressionM. The models expression1 through expressionM must each be valid Estimator regression expressions. There must be AT LEAST one model expression and there may be up to M model expressions (where M is the number of elements in each x vector). The special case of svmregress(); is assumed to mean svmregress(xtime,x1,x2,...,xm);

Percentile Grid Regression

         
             pgmregress(expression1, expression2,..., expressionM);         

                         for example:

             pgmregress (x1,exp(x12)/log(x3));
             

Trains a percentile grid regression Estimator WFF for the models expression1 through expressionM. The models expression1 through expressionM must each be valid Estimator regression expressions. There must be AT LEAST one model expression and there may be up to M model expressions (where M is the number of elements in each x vector).

Estimator Expressions

Valid Estimator regression expression well-formed-formulas (WFFs) can be constructed by recursively applying the following production grammar rules.

             TERMINAL: xtime x1 x2 ... xm

             explanation:

             A TERMIAL is any one of the elements of the input vector, x.

             example:

             svmregress (xtime, x3, x10);
             
             UNARY:    abs(EXP) cos(EXP) cube(EXP) exp(EXP) integer(EXP) log(EXP) (- EXP) sign(EXP) sin(EXP) sqrt(EXP) square(EXP) tan(EXP) tanh(EXP)

                                  explanation:

                              A UNARY is any one of the unary operators, shown above, enclosing a valid Estimator regression expression WFF.

                                  example:

                              svmregress (sin(xtime), abs(x3), log(sqrt(x10)));
             
             
             BINARY:   avg(EXP1,EXP2) (EXP1+EXP2) (EXP1-EXP2)  (EXP1/EXP2)  (EXP1*EXP2) expt(EXP1,EXP2) max(EXP1,EXP2) min(EXP1,EXP2) mod(EXP1,EXP2) 

                                  explanation:

                              A BINARY is any one of the binary operators, shown above, enclosing two valid Estimator regression expression WFFs.

                                  example:

                              svmregress (sin(xtime)-abs(x3), log(sqrt(x10))*x2);
             
             
             EXP:      TERMINAL UNARY BINARY  

                                  explanation:

                              An EXP is any WFF resulting from invoking the TERMINAL, UNARY, or BINARY Estimator grammar rules up to five times recursively.

                                  example:

                              pgmregress (sin(xtime)*abs(x3), tan(sqrt(x10))*x2);
             

A more detailed description of the Estimator language is provided in the chapters on Estimator Introduction and Estimator Statements.

Evolutionary Computation

There are many specialties in the umbrella field of Evolutionary Computation including (but not limited to): Genetic Algorithms; Genetic programming; Artificial Life; Grammatical Evolution; Evolvable Hardware, etc. However different, these various specialties all have a least common denominator of similarities; and, each of these specialties possess at least some techniques which have proven useful when applied to selected problem domains.

Populations

One of the primary least common denominator areas across all specialties in the umbrella field of Evolutionary Computation, is the commonality of large populations of things which change, over time, becoming a population of fitter things. In the specialties of Genetic Algorithms, Genetic Programming, and Grammatical Evolution, the things are called genomes. In other specialties the things are called Lambdas, organisms, robots, rules, DNA sequences, etc.

Fitness

Another important least common denominator area across all specialties in the umbrella field of Evolutionary Computation, is the commonality of fitness. The population is said to become fitter over time. In each specialty, the concept of fitness is expressed via a mapping from things in the population to a fitness score. In our observations, of each area of specialty, the concept of fitness mapping can be generalized to a mapping from elements of a population of things to a real vector space.

Survival

Survival is another least common denominator area across all specialties and is related to the fitness measure in some manner. In all specialties there are discrete time steps during which the things in the population are promoted (survive) into the next time frame. Normally, this survival mechanism is related to the fitness measure in that the population becomes fitter over time.

Operators

In almost all areas of specialty which we studied the concept of population operators morphing the population is a least common denominator. In the specialties of Genetic Algorithms, Genetic Programming, and Grammatical Evolution, the operators are called mutation, recombination, reproduction, etc. In many cases the population operators incorporate the concept of survival. In these architectures, the population operators map things (in the current population) into things (in the population one time step in the future). Things which are not promoted do not survive.

Data

All areas of specialty which we studied supported the concept of training data in one form or another. In many cases the training data can be numeric vectors. In other cases the training data is the initial environment for an ant colony, or a geographic playing field for robot traversal, etc. In all these architectures, the training data, population operators, and fitness mapping work together to produce populations of things which better solve a training problem. In many architectures, there is a concept of problem generalization. In these architectures, once trained, things in the population can be presented with testing data (not seen during the training period) and better solve a testing problem (presumably because the previous training experience was generalized).

Evaluation

All areas of specialty which we studied supported the concept of evaluation in one form or another. There must be some process by which each thing in the population can be evaluated so as to produce a fitness score. In the specialty of Genetic Programming, the genome is often an Lisp s-expression and the evaluation is the Lisp eval function. In the specialty of Genetic Algorithms, a translator is normally required to convert the genome into a program which can be evaluated. In the specialty of Grammatical Evolution, the genome is run through a grammar production resulting in a program which can be evaluated.

Design Principle

The Abstract Regression-Classification Machine Lambda (ARC) is a symbolic regression engine, designed to accept an X Y training matrix as input and to output an estimator Lambda (Estimator) which attempts to map elements in X onto elements in Y, in an order preserving manner, with a high degree of accuracy. In order to accomplish this task, the Abstract Regression-Classification Machine Lambda (ARC) is designed, to take advantage of practical research results in the general field of Evolutionary Computation, regardless of their specialty of origin.

Design Ideas

The Abstract Regression-Classification Machine (ARC) is designed, from its basic foundation, to take advantage of any practical research results in the general field of Evolutionary Computation, regardless of the specialty of origin. The Abstract Regression-Classification Machine architecture is a merger of a wide range of evolutionary techniques and theories from throughout the whole field of Evolutionary Computation.

Population of WFFs

The Abstract Regression-Classification Machine evolves a population of well-formed-formulas (WFFs) in a problem-oriented grammar known as Estimator. The Estimator grammar is defined, within the ARC Lambda, in arc:estimator:%DECLARATION, which is a feature-based grammar specification understood by the parseLib. The arc.estimator child Lambda is a "Estimator" parser, generated from arc:estimator:%DECLARATION by parseLib. The Estimator parser translates ASCII text strings into Estimator WFFs which are annotated s-expressions.

By implementing its genomes as annotated s-expressions, the Abstract Regression-Classification Machine takes advantage the large body of Genetic Programming tools while also facilitating the use of many interesting techniques available in other specialties. Bit string annotations to the WFF s-expressions, facilitate the use of many interesting techniques available in Genetic Algorithm. Grammar annotations to the WFF s-expressions, facilitate the use of many interesting techniques available in the Gramatical Evolution. Goal and plan annotations to the WFF s-expressions, facilitate the use of many interesting techniques available in MultiLambda Systems and Swarm Learning. Nevertheless, since each WFF in the population is essentially an s-expression, evaluation requires only a call to the AIS Lisp eval function.

Operators on WFFs

The Abstract Regression-Classification Machine applies a set of grammar preserving operators to evolve the population of well-formed-formulas (WFFs) into a population of fitter WFFs. The ARC population operators are grammar preserving such that each operator maps from a population of Estimator WFFs to a population of Estimator WFFs. Each Estimator WFF evaluates to an AIS Lambda which accepts a set of Number Vectors and returns a proper subset of those same Number Vectors (attempting to pick the fittest in the set).

The Abstract Regression-Classification Machine raises the grammar concepts embodied in Grammatical Evolution far above the WFF genome's grammar annotations, making the understanding and preservation of WFF grammar an essential task of the population operators. Thus ARC operators can understand that two WFF s-expressions cannot be combined because their are grammatically incompatible, that two ARC WFF s-expressions must be combined in specific ways because of their grammar annotations, that two ARC WFF s-expressions are essentially equivalent, or that two ARC WFF s-expressions can be combined and reduced to a more primal form.

Furthermore, the task of fitness improvement is raised far above any requirement for operational purity. Abstract Regression-Classification Machine WFFs can be annotated with a wide variety of information to help the population operators raise the population fitness level. Annotations can aid operators in establishing protected or unprotected sub-populations. Annotations can aid operators in enhancing populations with WFFs generated via exhaustive parital-search, statistical learning methods, etc. In general the ARC uses annotations to extend the style and type of population operator far beyond those listed in any one specialty while requiring only that the operator be grammar preserving.

Data

The Abstract Regression-Classification Machine learns to select and score the best individuals from a universe of individuals over time. Over a series of discrete time steps, a universe of individuals is collected for each time step (the training data). During prediction the ARC attempts to select the best individuals from the testing data (the ARC is NOT given the "score" values for the testing data). The individuals are Number vectors and represent quantitative information about things such as Stocks, People, Cities, etc. The score values are single Numbers and represent some quantitative value about these same Stocks, People, Cities, etc.

Each individual followed by the system is given a unique identifier which remains unique across all time periods studied (no two individuals ever have the same identifier). Furthermore, each time period studied is given a unique ascending integer index (i.e. week 1, week 2, etc.). So, for a series of time periods, historical information about groups of individuals is collected for each time period. The historical information collected for each individual for each time period is stored in a Number Vector and includes: the time period index; the unique identifier of the individual; and other numeric information about the individual pertinent to the current investigation. Finally, each individual in each time period is given a numeric "score" which determines the value of the individual in that time period. The "best" individuals have the highest "score" values.

Fitness of WFFs

During training, the ARC is given historical information and the "score" values for time periods 0 through T for all individuals. For each time period 0 through T, we can calculate the average score for both the theoretic best and worst selections of N individuals. The ARC currently computes the theoretic best and worst scores for the each of the top and bottom 5% partitions (5%, 10%, 15%, ... 45%, 50%). The order preserving "fitness" of a Estimator WFF are measured by what percentage of the theoretical "best" averages it obtained on each of these top and bottom percentile partitions. In addition, the ARC also computes the least squares error between the predictions and the "score" values for each individual.

During training, each WFF in the population (without knowing the score values), attempts to score the N individuals for time periods 0 through T. Associated with each WFF in the population {w in P} is the least squares error and the order preservation measure for time periods 0 through T. These two fittness measures are normalized into a single fittness score as follows S[w,t] = g(Error[w,t],Order[w,t]).

The fitness measure for each WFF {F[w] == (S[w,0] * S[w,1] * ... * S[w,t])} is the product of its normalized scores for time periods 0 through T. The fitness measure tends to favor those WFFs which have done consistantly well during the training period (as opposed to those with only an occasional big success).

Population Operators

The Abstract Regression-Classification Machine applies a set of grammar preserving operators to evolve the population of well-formed-formulas (WFFs) into a population of fitter WFFs. The population operators are grammar preserving such that each operator maps from a population of Estimator WFFs to a population of Estimator WFFs. The understanding and preservation of WFF grammar is an essential task of the population operators. WFFs can be annotated with a wide variety of information to help the population operators raise the population fitness level. Annotations can aid operators in establishing protected or unprotected sub-populations evolved with different grammars. Annotations can aid operators in enhancing populations with WFFs generated via exhaustive parital-search, statistical learning methods, etc.

Here follows an overview of the population operators supported during Abstract Regression-Classification.

The grow operator generates a new WFF, from the empty set, by randomly invoking Estimator grammar rules so as to create a WFF.

The mutate operator generates a new WFF, from a parent WFF, by randomly mutating the parent WFF in a grammar preserving manner.

The children operator generates two new WFFs, from two parent WFFs, by randomly invoking Estimator grammar rules so as to cut and recombine child WFFs in a grammar preserving manner.

The greed operator generates a new WFF, from the empty set, by using greedy search to invoke Estimator grammar rules so as to create a WFF.

Multiple Grammars

The Abstract Regression-Classification Machine supports multiple grammars in evolving the population of well-formed-formulas (WFFs) problem solutions. All ARC population operators are grammar preserving such that each operator maps from a population of Estimator WFFs to a population of Estimator WFFs, in a grammar preserving manner. The design concept of grammar preservation extends to establishing protected and unprotected sub-populations evolved with different grammars. The understanding and preservation of multiple WFF grammars is an essential task of the Abstract Regression-Classification Machine. The Abstract Regression-Classification Machine certainly attempts to evolve general programs, with evolved functions and variables; but, there are also attempts to evolve program solutions using Support Vectors, Multiple Regression, and Decision Tree models. In fact the ARC supports the co-evolution of a wide variety of Estimator WFF grammars, providing potentially unique strategies for approaching the problem.

There Estimator grammars currently supported during Abstract Regression-Classification.

The REG grammar forms valid Estimator WFFs which train and run a simple linear regression model on a the training data set.

The MVL grammar forms valid Estimator WFFs which train and run a multiple linear regression model on a the training data set.

The SVM grammar forms valid Estimator WFFs which train and run a support vector regression model with a cubic kernel on a the training data set.

The FRM grammar forms valid Estimator WFFs which train and run a multiple factor regression model on the training data set.

Example

An example, but by no means the only example, of X and Y would be a set of vectors of stock market data taken from the Open High Low Close and Volume numbers for all NASDQ traded stocks over a 52 week period. In this example we have the following:

xtime The sequential index of the current week (from 1 through 52).
x1 The unique integer identifier of the stock (stocks not traded would not appear).
x2 The current week opening price.
x3 The current week high price.
x4 The current week low price.
x5 The current week closing price.
x6 The current week share volume traded.
Y The "score" vector of next week profits (next_week_closing_price - the_current_week_closing_price).

Similar examples can be constructed for oil exploration data over time, for the height and weight of individuals over time, etc. However, continuing with our securities example, we train our machine on the market data for four stocks over a four week period as follows:

Training Data

TimeStockOpenHighLowCloseVolScore
x.xtimex.x1x.x2x.x3x.x4x.x5x.x6y
(first week)Note: (next week's profit)
01 (Apple)$23.45$25.67$23.35$24.56193673.4%
02 (IBM)$143.45$145.27$143.15$144.96894676-1.2%
03 (Xerox)$13.95$15.27$13.35$14.72568324.8%
04 (GM)$57.15$62.17$53.65$62.0534196479.1%
(second week)Note: (next week's profit)
11 (Apple)$24.56$25.38$22.75$25.40120461.2%
12 (IBM)$144.96$144.96$143.15$143.23864023-3.2%
13 (Xerox)$14.72$16.12$14.39$15.43592043.4%
14 (GM)$62.05$62.05$68.00$67.7032193826.5%
(third week)Note: (next week's profit)
21 (Apple)$25.40$26.98$24.75$25.71220560.8%
22 (IBM)$143.23$143.23$136.75$138.64824093-4.3%
23 (Xerox)$15.43$16.45$15.09$15.9661205-1.4%
24 (GM)$67.70$75.35$66.39$72.1036195827.8%

We train the ARC on the training data shown above. After training, we show the machine the following testing data, TX, and ask it to return an estimate of the next week's profit, TY, for each of the four individuals. We do not show the machine the scores, TY.

Testing Data

TimeStockOpenHighLowCloseVolScore
tx.xtimetx.x1tx.x2tx.x3tx.x4tx.x5tx.x6ty
(fourth week)Note: (next week's profit)
31 (Apple)$25.71$26.18$25.55$25.9225046-1.2%
32 (IBM)$138.64$139.23$131.15$132.67774593-6.1%
33 (Xerox)$15.96$16.13$15.00$15.73592052.4%
34 (GM)$72.10$77.87$71.39$77.7337105825.8%

Resulting Estimates

After testing, the learning machine returns the following estimated scores, EY.

EstimateScore
eyty
-2.3%-1.2%
-7.6%-6.1%
1.9%2.4%
4.9%5.8%

We calculate the least squares error on the four individuals as 1.13%, so we did not a mediocre job of minimizing least squares error; however, we did a perfect job of preserving the sort order of the corresponding TY values.

FAQ

Question 1: Why must we train the learning machine on collections of individuals in each time period? Why not simply train the machine on each individual separately and perform a normal regression estimate for each individual?

Answer: Many real world estimates cannot be made unless one is aware of the competitive land scape.

For instance, suppose one is estimating the social popularity of students in the senior high school class. We can perform any number of individual regressions correlating high scores for intelligence, appearance, and social skills with social popularity. However, all of these individual regression models are greatly skewed in the case where all students in the senior high school class are male except one student who is female.

Also, suppose one is estimating the financial popularity of our Bank's Certificates of Deposit. We perform any number of individual regressions correlating our Bank's previous Certificates of Deposit with their financial popularity. However, all of these individual regression models are greatly skewed in the case where one of our competitors is advertising an aggressive interest rate two percentage points higher than ours.

Question 2: Why must we ask the learning machine to preserve the natural order of the individuals in the testing time period? Why not simply have the machine provide more accurate regression estimates for each individual in the testing time period?

Answer: Many real world estimates cannot be made for all individuals; but, only for a few individuals.

For instance, suppose one is estimating currency conversion rates. Normally these rates have very small random daily changes. However, every so often, a central bank will pre-announce its intention to buy or sell its own currency. In those special cases the learning machine will want to provide its normal estimates on most currencies; yet, make a special higher than normal estimate for the currency whose central bank has pre-announced.

Many real world situations do not allow us to accurately estimate the score of the best individuals; but, only to guess which might be the better individuals than others.

For instance, if ten monkeys and one five year old human child are taking a simple IQ test (this being all the information we have). We cannot, with the meager information provided, accurately estimate the IQ scores of the contestants after they take the IQ test. However, we can reasonably make the guess that the human child will have the better IQ score (whatever that score may be).