|
25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more... |
Home Projects Up Sitemap Search Blog Forum+Chat About Us Privacy Terms of Use Feedback FAQ Images Services Payments Humor Music |
C++ class for descriptive statistics, linear regression using best-fit data transformation, correlation coefficient, Student's t significance test, and prediction (forecasting)Tested with Microsoft Visual C++ 2008 and GNU GCC g++ 4.3, SStatArray (statistics calculating array) has functions to:
The class stores (accumulates) the input numbers or ordered pairs (independent and dependent variables) internally in a vector, so new values can be added at any time. Stats are automatically recalculated when it's necessary. There are overloaded operators that allow an instance of the class to be used as though it is a double or complex<double> variable. When used this way, it is a variable that keeps a history of its own prior values and can calculate statistics about the sequence of values it has held. The functions included in this class can be done more easily and more thoroughly in a spreadsheet program like Excel or Calc for a one-time data analysis where a person reviews the output and makes decisions based on the results. By contrast, this class allows doing statistical calculations and significance testing in an automated way as part of a larger C++ program that, unassisted, makes decisions based on the results. One possible application is for analyzing large volumes of data and only saving or reporting results that show promise of statistical significance (something like data mining). Another application might be in artificial intelligence. For example:
This SStatArray class is an expanded and improved descendant of my earlier Borland C++ version, and it incorporates the mode calculation method from my otherwise less capable JavaScript statistics function written for my online statistics calculator. Some differences in the new C++ version:
RequirementsIn addition to the standard header files, this program requires:
At the bottom of this page is a utility program that uses an SStatArray to read data from a file and report the stats to the console. The following sample output uses a format in which each set of numbers in parentheses is a complex<double>(X,Y) with values for the X array in the real part and values for the Y array in imag: |
1 20 <- the input data pairs 2 22.7726 3 24.3944 4 25.5452 5 26.4378 6 27.167 7 27.7836 8 28.3178 9 28.7889 Count 9 Low (1,20) High (9,28.7889) Range (8,8.7889) Sum (45,231.207) ArithMean (5,25.6897) GeoMean (4.14717,25.536) HarMean (3.18137,25.3716) Median (5,26.4378) StDevPop (2.58199,2.71251) VariancePop (6.66667,7.3577) StDevEst (2.73861,2.87705) VarianceEst (7.5,8.27741) Mode Freq. (0,0) XMode(s) YMode(s) Curve (+Best Fit) YInt(b) Slope(m) Beta(m) R2 StdErrEst T(of R2) T(Slope) TRatioR2 TRatioSlope 0: Linear y = b + mx 20.6738 1.00319 0.954919 0.911871 0.805251 8.5105 8.5105 6.01449 6.01449 +1: Logarithmic y = b + mLN(x) 20 4 1 1 3.26646e-005 219706 219706 155269 155269 2: Parabolic y = b + mxx 22.8664 0.0891554 0.870164 0.757186 1.33662 4.67211 4.67211 3.30185 3.30185 3: Hyperbolic y = b + m/x 28.7088 -9.60482 -0.951993 0.90629 0.830354 8.22793 8.22793 5.81479 5.81479 4: Inverse Linear y = 1 / (b + mx) 0.0476802 -0.00165321 -0.918017 0.842755 0.00184383 6.12506 6.12506 4.32867 4.32867 5: Exponential y = be^(mx) 20.8505 0.0405428 0.937791 0.879452 0.0387563 7.14619 7.14619 5.05031 5.05031 6: Power y = bx^m 20.2124 0.164364 0.998517 0.997037 0.00607664 48.5292 48.5292 34.2962 34.2962 Predictions: #0: 9 -> 28.7889 #1: 10 -> 29.2103 #2: 11 -> 29.5916 #3: 12 -> 29.9396 #4: 13 -> 30.2598 #5: 14 -> 30.5562 #6: 15 -> 30.8322 #7: 16 -> 31.0904 #8: 17 -> 31.3329 #9: 18 -> 31.5615 #10: 19 -> 31.7778 #11: 20 -> 31.9829 #12: 21 -> 32.1781 #13: 22 -> 32.3642 #14: 23 -> 32.542 #15: 24 -> 32.7122 #16: 25 -> 32.8755 #17: 26 -> 33.0324 #18: 27 -> 33.1833 #19: 28 -> 33.3288 #20: 29 -> 33.4692 #21: 30 -> 33.6048 #22: 31 -> 33.7359 #23: 32 -> 33.8629 Press <Enter> to close the window...
/* stataray.h 1-23-2010 For Microsoft Visual C++ or GNU g++ (GCC)
Copyright (C)1997-2001, 2006, 2009-10 Steven Whitney.
Initially published by http://25yearsofprogramming.com
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
Version 3 as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
--Holds a 1 or 2 dimensional array of data points, and can return statistics about the data.
(When only 1 dimension (Y) is supplied, it auto-generates corresponding X values starting at 1.)
--X is independent variable, Y is dependent.
--Functions that return complex: x result is in real, y result in imag.
--All the Y Intercept calculations look ok when compared to a graph (TIPPv35.DAT)
--For all curves, the reported m b values are the right ones to plug back into the untransformed form of the equation
--For the few I checked in Excel, plotting raw data on graphs with transformed axes really does turn the curve into a line.
02-01-2009: added fns to return Variances and RSquared values.
02-02-2009: added calc of Standard Error of Estimate and output of t-score and t-ratio of RSquared
02-20-2009: added fn ConvertToZScores()
02-21-2009: added dStdBetaWeight variables, calculations, and retrieval fns; TStatisticOfCorr(i);
02-25-2009: revised Standard Error of Estimate calculations, split into Pop and Est;
added and tested Standard Error of Slope, t statistic and ratio for slope
01-23-2010: Minor changes to make it standard C++ and GNU g++ compatible:
Changed all System::Double::NaN to std::numeric_limits<double>::quiet_NaN()
1 other change, noted in ReadData()
------
TO DO:
--need skewness and kurtosis calculations. See TI Applied Statistics p.4-2, which is more understandable than Wikipedia.
--maybe allow creating an array of predicted values for comparison against the real ones.
--would a sine transform catch cycles?
--a variant of this might be useful as a general purpose "state correlator and
predictor", e.g. to help determine cause & effect or probabilities of outcomes.
you'd code states numerically. Example in chess: given this board state, my
opponent usually does this... More generally, when this state exists, that one
usually follows. (see complex.doc for possible related discussion, and for
related discussion of pattern-finding: "increasing", "decreasing" can be
determined from slope, etc.).
------
USES:
--May be useful in applications such as Wadapt.cpp, where you want to make
data accumulation and statistical calculations an easily-accomplished
"functional subunit". When used in place of a regular variable, it gives an
agent a historical record of the values of the variable, and the inherent
ability to predict future values, detect trends, etc.
--The purpose of the Chi Squared feature is to allow an intelligent agent to answer: "Is this
occurrence normal (statistically likely) or abnormal (statistically unlikely) and
therefore interesting and worth paying attention to?"
------
NOTES:
--For t-test min df=1, so min Count==3,
--Question: Why is the normalized slope (beta weight) not the same as the original slope?
Answer: Because in addition to the translation such that the arrays are centered on (0,0),
X and Y also become standardized to normal scores, which effectively changes the scaling of the axes relative to each other (squishes them)
and changes the positions of all the points. If the original distributions of X and Y have identical means and stddev's,
the normalized slope will be the same.
--The Student's t values and ratios for the Correlation Coefficient and the Slope are always the same.
I think it is because the only connection between the X and Y arrays IS the slope. Thus, it must account for 100% of the correlation.
Early on, I suspected they should be the same; it is a relief that they turn out to be so.
--There is a summary of linear regression formulas at http://www.gseis.ucla.edu/courses/ed231a1/notes2/slr0.html
*/
#ifndef __STATARAY_H
#define __STATARAY_H
#include <complex>
#include <vector>
#include <algorithm> // for sort()
#include <functional> // for greater<double> or less<double>
#include "my.h" // you must use the corresponding VC++ or g++ version of my.h
using namespace std;
//////////////////////////////////////////////////////////////////////////////
// an SStatArray holds a data set, and can provide statistics about it.
class SStatArray
{
public:
SStatArray();
SStatArray(SStatArray& other, // copy all or a subset of the source's data
double lowx = -numeric_limits<double>::max(), double highx = numeric_limits<double>::max(),
double lowy = -numeric_limits<double>::max(), double highy = numeric_limits<double>::max());
SStatArray& operator = (const SStatArray& other); // create copy
double operator = (double d); // add # to array
complex<double> operator = (complex<double> c); // add # pair to array
complex<double>& operator [](uint loc); // unsure whether it's desirable to allow this
bool operator == (const SStatArray& other) const;
operator complex<double> (); // allow use as a complex
operator double (); // allow use as a double
friend ostream& operator << (ostream& os, SStatArray& s); // write stats
friend istream& operator >> (istream& is, SStatArray& s); // append array data
ostream& WriteData(ostream& os); // write array data
void WriteData(const string& filename); // write array data to file
void ReadData(const string& filename); // read array data from a file
int Calc(); // force (re)calculating all statistics
int mCalc() { return(changed ? Calc() : 1); } // recalc only if data changed
// THESE VALUES ARE ALWAYS FOR THE RAW DATA
uint Count() const { return Dataset.size(); }
uint size() const { return Count(); } // synonym for Count()
complex<double> High() { mCalc(); return(dHigh); }
complex<double> Low() { mCalc(); return(dLow); }
complex<double> Last();
complex<double> Range() { mCalc(); return(dRange); }
complex<double> Median() { mCalc(); return(dMedian); }
complex<double> GeoMean() { mCalc(); return(dGeoMean); }
complex<double> HarMean() { mCalc(); return(dHarMean); }
// it's really up to the caller to first ensure there's at least one mode to get,
// then get them with for(uint i = 0 ; i < ModesCountXY() ; i++) v = ModeXY(i);
uint ModesCountX() { mCalc(); return(dModesX.size()); } // number of modal values
uint ModesCountY() { mCalc(); return(dModesY.size()); }
uint ModeFreqX() { mCalc(); return(dModeFreqX); } // frequency of modal value(s)
uint ModeFreqY() { mCalc(); return(dModeFreqY); }
double ModeX(uint i = 0) { mCalc(); return((dModesX.size() > i) ? dModesX[i] : std::numeric_limits<double>::quiet_NaN()); }
double ModeY(uint i = 0) { mCalc(); return((dModesY.size() > i) ? dModesY[i] : std::numeric_limits<double>::quiet_NaN()); }
// YOU USUALLY WANT THESE FOR THE LINEAR MODEL (RAW DATA, i = 0), BUT CAN GET THE OTHERS
complex<double> Sum(int i = 0) { mCalc(); return(dSum[i]); }
complex<double> Mean(int i = 0) { mCalc(); return(dMean[i]); }
complex<double> StDevEst(int i = 0) { mCalc(); return(dStDevEst[i]); }
complex<double> StDevPop(int i = 0) { mCalc(); return(dStDevPop[i]); }
complex<double> VarianceEst(int i = 0) { mCalc(); return complex<double>(real(dStDevEst[i]) * real(dStDevEst[i]), imag(dStDevEst[i]) * imag(dStDevEst[i])); }
complex<double> VariancePop(int i = 0) { mCalc(); return complex<double>(real(dStDevPop[i]) * real(dStDevPop[i]), imag(dStDevPop[i]) * imag(dStDevPop[i])); }
// ADDITIONAL VERSIONS OF THE ABOVE, RETURNING RESULT FOR ONLY X OR Y
double HighX() { return(real(High())); }
double LowX() { return(real(Low())); }
double LastX() { return(real(Last())); }
double RangeX() { return(real(Range())); }
double MedianX() { return(real(Median())); }
double GeoMeanX() { return(real(GeoMean())); }
double HarMeanX() { return(real(HarMean())); }
double SumX(int i = 0) { return(real(Sum(i))); }
double MeanX(int i = 0) { return(real(Mean(i))); }
double StDevEstX(int i = 0) { return(real(StDevEst(i))); }
double StDevPopX(int i = 0) { return(real(StDevPop(i))); }
double VarianceEstX(int i = 0) { return(real(VarianceEst(i))); }
double VariancePopX(int i = 0) { return(real(VariancePop(i))); }
double HighY() { return(imag(High())); }
double LowY() { return(imag(Low())); }
double LastY() { return(imag(Last())); }
double RangeY() { return(imag(Range())); }
double MedianY() { return(imag(Median())); }
double GeoMeanY() { return(imag(GeoMean())); }
double HarMeanY() { return(imag(HarMean())); }
double SumY(int i = 0) { return(imag(Sum(i))); }
double MeanY(int i = 0) { return(imag(Mean(i))); }
double StDevEstY(int i = 0) { return(imag(StDevEst(i))); }
double StDevPopY(int i = 0) { return(imag(StDevPop(i))); }
double VarianceEstY(int i = 0) { return(imag(VarianceEst(i))); }
double VariancePopY(int i = 0) { return(imag(VariancePop(i))); }
// YOU USUALLY WANT THESE FOR THE BESTFIT MODEL (i = -1), BUT CAN GET THE OTHERS
double Corr(int i = -1); // Correlation coefficient
double RSquared(int i = -1); // returns Corr()*Corr(), coefficient of determination, % of Y variance attributable to changes in X
double TStatisticOfCorr(int i = -1); // Student's t statistic appropriate for assessing significance of Pearson's correlation coefficient
double TCorrSignifRatio(int i = -1); // Ratio of t to the critical value of t for current df
double StdErrOfEstimateYPop(int i = -1);// Population (biased, N-weighted) Standard Error of Estimate of Y predicted from X
double StdErrOfEstimateYEst(int i = -1);// Estimate(unbiased, (N-2)), without calculating residuals
double StandardErrorY(int i = -1); // calculate Y residuals, store in ResidualsY, and return unbiased (N-2 weighting) Standard Error of Estimate
double StdBetaWeight(int i = -1); // multiple regression version of Slope(), where X and Y arrays are z-scores; YIntercept is always 0.
double TStatisticOfSlope(int i = -1); // Students t for assessing whether slope is significantly different from zero
double TSlopeSignifRatio(int i = -1); // Ratio of t to the critical value for the SLOPE for the current df
double StdErrorOfSlope(int i = -1); // Standard Error of the Slope (determines upper and lower limits of true slope, within confidence)
double Slope(int i = -1);
double YIntercept(int i = -1);
double XIntercept(int i = -1);
double PredictX(double y, int i = -1);
double PredictY(double x, int i = -1);
complex<double> Next(int i = -1); // return prediction of the next point (x,y)
string GetCurveText(int i = -1); // returns text description of curve, with its equation
int GetBestFit() { mCalc(); return(BestFit); }
uint GetConfidenceLevel() { return dConfidenceLevel; } // returns current level
uint SetConfidenceLevel(uint newlevel); // only allows setting to legal levels
// OTHER FUNCTIONS (FUNCTIONS TAILORED FOR PARTICULAR APPLICATIONS WILL GO HERE)
complex<double> ModeYForXBetween(double lowx = -numeric_limits<double>::max(), double highx = numeric_limits<double>::max());
bool ConvertToTimeSeries();
bool ConvertFromTimeSeries();
bool ConvertToZScores(bool convertx = true, bool converty = true);
complex<double> ChiSq();
bool ChiSqIsSignif(bool useYates = false);
double GetC5();
double GetC10();
// overridden array functions that can change the array data, plus some alternate versions
void Add(complex<double> c);
void Add(double x, double y);
void Add(double y);
// special-purpose: for string description & analysis
int Add(const string& s); // adds all chars, to length()
int Add(const char* s); // adds all chars, to 0 terminator
void Flush(); // empty Dataset
// variables
enum { LINE, LOG, PARA, HYPER, INVLINE, EXP, POW, QUAD }; // transformation types
bool OutputAllStats; // if false, << writes only data for BestFit
bool changed; // array data has been changed since the last stats calc
vector<double> ResidualsY; // (Y - Predicted Y') for each X
protected:
void Summate(int,double,double); // utility function tallies dSum, dSum2, dSumXY
uint dConfidenceLevel; // determines critical values thresholds for all stats tests
vector< complex<double> > Dataset; // the untransformed raw data points (x,y) = (independent,dependent)
int BestFit; // which curve best fits the data
bool AllowGHMeanX; // any input <= 0 turns these false
bool AllowGHMeanY;
// not used in regression calcs. 1 of each is sufficient. (describes raw data only)
complex<double> dHigh;
complex<double> dLow;
complex<double> dRange;
complex<double> dMedian;
complex<double> dGeoMean;
complex<double> dHarMean;
uint dModeFreqX; // frequency of modal value(s)
uint dModeFreqY;
vector<double> dModesX; // vectors allow for > 1 mode
vector<double> dModesY;
// used for regression, one required for each type of curve (data transformation) handled.
// each array index corresponds to the data needed for one type of curve,
// so for linear you use all the [0] elements.
// Some of the arrays contain duplicate data, but the ease of use is worth it.
bool Allow[8]; // whether calculation for this curve can be allowed
complex<double> dSum[8]; // sums of the transformed x and y inputs
complex<double> dSum2[8]; // sums of the squares of the transformed x and y inputs
complex<double> dMean[8]; // means of the transformed x and y inputs
complex<double> dStDevEst[8]; // N-1 StDev of the transformed x and y inputs
complex<double> dStDevPop[8]; // N StDev of the transformed x and y inputs
double dSumXY[8]; // sums of the products XY (after transformation)
double dSlope[8]; // slopes of the curves (after their transformations into lines)
double dYIntercept[8]; // (transformed) Y intercepts of the transformed lines
double dCorr[8]; // correlations between X and Y arrays (used to determine best fit)
double dTStatisticOfCorr[8]; // Student's t calculated for each transformation
double dTCorrSignifRatio[8]; // Ratio of t to the critical value of t for current df, > 1.0 is significant, magnitude indicates degree of excess signif.
double dTStatisticOfSlope[8]; // Student's t value calculated on dSlope (not the standardized beta weight because its Standard Error is not available);
// to get t for the beta weight, convert X and Y to z-scores first so that YInt is zero and dSlope IS the beta weight.
double dTSlopeSignifRatio[8]; // Ratio of t to the critical value for current df, > 1.0 is significant, magnitude indicates degree of excess signif.
double dStdErrorOfSlope[8]; // Standard Error of the Slope (determines limits between which the true slope falls, with a given confidence)
double dStdBetaWeight[8]; // regression slope obtained when all X and Y values are converted to z-scores within their distributions;
// since both distributions are centered at (0,0), the YIntercept is always 0 in theory and within about 1e-15 in practice;
// used in multiple linear regression.
double dStdErrOfEstimateYPop[8];// Standard Error of Estimate of Y on X (Y predicted from X), for each model. Population value (biased). Syx
// its validity assumes homoscedasticity: same Y variance across the range of X values;
double dStdErrOfEstimateYEst[8];// Standard Error of Estimate of Y on X (Y predicted from X), for each model. Estimated value (unbiased).
static double C5[30]; // Chi-Squared critical value lookup tables
static double C10[30]; // only 1 copy of each array, used by all class instances
static double TCrit900[101]; // Student's t critical value lookup tables; each entry corresponds to df = its index + 1
static double TCrit950[101]; // So TCrit[0] corresponds to df = 1
static double TCrit975[101]; // Contrary to convention, these arrays are named by their confidence levels,
static double TCrit990[101]; // TCrit990 = 99.0%, so .01 (1%) of the area is above the critical value.
static double TCrit995[101];
static double TCrit999[101];
};
#endif // __stataray_h
/* stataray.cpp 1-23-2010 For Microsoft Visual C++ or GNU g++ (GCC)
Copyright (C)1997-2000, 2006, 2009-10 Steven Whitney.
Initially published by http://25yearsofprogramming.com
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
Version 3 as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
--Code for class SStatArray
--#include this file AFTER my.h and mylib.cpp
*/
#include "stataray.h"
using namespace std;
//////////////////////////////////////////////////////////////////////////////
//----------------------------------------------------------------------------
// External declarations and initializations for these 2 big arrays.
// comparison values for Chi-Squared testing when degrees of freedom <= 30.
// 1-22-2009 Table values compare reasonably well with an online calculator.
//
// C5: Q(x) = 5% If your X is ABOVE this, it is significantly different from
// the normally distributed population
//
// After testing to compare with book examples, should use C10,
// except I don't know the formulas to calculate C10 when dF > 30, etc.
double SStatArray::C5[30] =
{
3.841455338, 5.991476357, 7.814724703, 9.487728465, 11.07048257, 12.59157742,
14.06712726, 15.50731249, 16.91896016, 18.30702904, 19.67515307, 21.02605538,
22.36202656, 23.68478234, 24.99579669, 26.29622094, 27.58710028, 28.86932099,
30.14350506, 31.41042038, 32.67055801, 33.92445983, 35.17246022, 36.41502646,
37.6524894 , 38.88512964, 40.11326562, 41.33715169, 42.55694751, 43.77295389
};
//----------------------------------------------------------------------------
// C10: Q(x) = 10% If your X is ABOVE this, it is significantly different from
// the normally distributed population
double SStatArray::C10[30] =
{
2.705540585, 4.605176125, 6.251394452, 7.779433964, 9.236349099, 10.64463748,
12.01703137, 13.36156189, 14.68366318, 15.98717471, 17.2750067 , 18.54934022,
19.81193273, 21.0641406 , 22.30712056, 23.54182147, 24.76902818, 25.9894184 ,
27.20356481, 28.41196991, 29.61508586, 30.8132853 , 32.00689016, 33.19623509,
34.38158333, 35.56316367, 36.74122756, 37.91590744, 39.08747534, 40.25601698
};
//----------------------------------------------------------------------------
// Student's t critical value lookup tables; each entry corresponds to df = its index + 1. Ex: TCrit[0] corresponds to df = 1.
// Contrary to convention, these arrays are named by their confidence levels rather than area above the critical value.
// TCrit990 = 99.0, so .01 (1%) of the area is above the critical value.
// For a 1-tailed test, the entire Ho rejection region is in one tail.
// For a 1-tailed test at 90% confidence, the entire 10% is in 1 tail; test for (t > critical value in TCrit900).
// For a 2-tailed test at 90% confidence, 5% is in each tail; test for (abs(t) > critical value in TCrit950).
// Now that I've switched to the TI59 t calculation (which is always positive) the abs() calculation is not necessary,
// but you still use TCrit950; the chances that t will land in that region are doubled because abs() was done when calculating t.
// TCrit[100] (101th element) in each array is for df=infinity, which is for the standard normal curve, used for all df > 100.
// These tables are from http://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm
// I wanted to calculate P(t) instead of using tables, but it is difficult, requiring Beta or Gamma functions (see http://www.rskey.org/gamma.htm)
// The TI59 program for t (Applied Statistics ST-21) does have somewhere in it the calculations for the Gamma function and the CDF series expansion
// approximation of t, so they can't be so difficult that they're impossible. They fit in the calculator program. (See ST-21 p. 6-12 and 6-14).
double SStatArray::TCrit900[101] =
{
3.078, 1.886, 1.638, 1.533, 1.476, 1.440, 1.415, 1.397, 1.383, 1.372,
1.363, 1.356, 1.350, 1.345, 1.341, 1.337, 1.333, 1.330, 1.328, 1.325,
1.323, 1.321, 1.319, 1.318, 1.316, 1.315, 1.314, 1.313, 1.311, 1.310,
1.309, 1.309, 1.308, 1.307, 1.306, 1.306, 1.305, 1.304, 1.304, 1.303,
1.303, 1.302, 1.302, 1.301, 1.301, 1.300, 1.300, 1.299, 1.299, 1.299,
1.298, 1.298, 1.298, 1.297, 1.297, 1.297, 1.297, 1.296, 1.296, 1.296,
1.296, 1.295, 1.295, 1.295, 1.295, 1.295, 1.294, 1.294, 1.294, 1.294,
1.294, 1.293, 1.293, 1.293, 1.293, 1.293, 1.293, 1.292, 1.292, 1.292,
1.292, 1.292, 1.292, 1.292, 1.292, 1.291, 1.291, 1.291, 1.291, 1.291,
1.291, 1.291, 1.291, 1.291, 1.291, 1.290, 1.290, 1.290, 1.290, 1.290, 1.282
};
//----------------------------------------------------------------------------
double SStatArray::TCrit950[101] =
{
6.314, 2.920, 2.353, 2.132, 2.015, 1.943, 1.895, 1.860, 1.833, 1.812,
1.796, 1.782, 1.771, 1.761, 1.753, 1.746, 1.740, 1.734, 1.729, 1.725,
1.721, 1.717, 1.714, 1.711, 1.708, 1.706, 1.703, 1.701, 1.699, 1.697,
1.696, 1.694, 1.692, 1.691, 1.690, 1.688, 1.687, 1.686, 1.685, 1.684,
1.683, 1.682, 1.681, 1.680, 1.679, 1.679, 1.678, 1.677, 1.677, 1.676,
1.675, 1.675, 1.674, 1.674, 1.673, 1.673, 1.672, 1.672, 1.671, 1.671,
1.670, 1.670, 1.669, 1.669, 1.669, 1.668, 1.668, 1.668, 1.667, 1.667,
1.667, 1.666, 1.666, 1.666, 1.665, 1.665, 1.665, 1.665, 1.664, 1.664,
1.664, 1.664, 1.663, 1.663, 1.663, 1.663, 1.663, 1.662, 1.662, 1.662,
1.662, 1.662, 1.661, 1.661, 1.661, 1.661, 1.661, 1.661, 1.660, 1.660, 1.645
};
//----------------------------------------------------------------------------
double SStatArray::TCrit975[101] =
{
12.706, 4.303, 3.182, 2.776, 2.571, 2.447, 2.365, 2.306, 2.262, 2.228,
2.201, 2.179, 2.160, 2.145, 2.131, 2.120, 2.110, 2.101, 2.093, 2.086,
2.080, 2.074, 2.069, 2.064, 2.060, 2.056, 2.052, 2.048, 2.045, 2.042,
2.040, 2.037, 2.035, 2.032, 2.030, 2.028, 2.026, 2.024, 2.023, 2.021,
2.020, 2.018, 2.017, 2.015, 2.014, 2.013, 2.012, 2.011, 2.010, 2.009,
2.008, 2.007, 2.006, 2.005, 2.004, 2.003, 2.002, 2.002, 2.001, 2.000,
2.000, 1.999, 1.998, 1.998, 1.997, 1.997, 1.996, 1.995, 1.995, 1.994,
1.994, 1.993, 1.993, 1.993, 1.992, 1.992, 1.991, 1.991, 1.990, 1.990,
1.990, 1.989, 1.989, 1.989, 1.988, 1.988, 1.988, 1.987, 1.987, 1.987,
1.986, 1.986, 1.986, 1.986, 1.985, 1.985, 1.985, 1.984, 1.984, 1.984, 1.960
};
//----------------------------------------------------------------------------
double SStatArray::TCrit990[101] =
{
31.821, 6.965, 4.541, 3.747, 3.365, 3.143, 2.998, 2.896, 2.821, 2.764,
2.718, 2.681, 2.650, 2.624, 2.602, 2.583, 2.567, 2.552, 2.539, 2.528,
2.518, 2.508, 2.500, 2.492, 2.485, 2.479, 2.473, 2.467, 2.462, 2.457,
2.453, 2.449, 2.445, 2.441, 2.438, 2.434, 2.431, 2.429, 2.426, 2.423,
2.421, 2.418, 2.416, 2.414, 2.412, 2.410, 2.408, 2.407, 2.405, 2.403,
2.402, 2.400, 2.399, 2.397, 2.396, 2.395, 2.394, 2.392, 2.391, 2.390,
2.389, 2.388, 2.387, 2.386, 2.385, 2.384, 2.383, 2.382, 2.382, 2.381,
2.380, 2.379, 2.379, 2.378, 2.377, 2.376, 2.376, 2.375, 2.374, 2.374,
2.373, 2.373, 2.372, 2.372, 2.371, 2.370, 2.370, 2.369, 2.369, 2.368,
2.368, 2.368, 2.367, 2.367, 2.366, 2.366, 2.365, 2.365, 2.365, 2.364, 2.326
};
//----------------------------------------------------------------------------
double SStatArray::TCrit995[101] =
{
63.657, 9.925, 5.841, 4.604, 4.032, 3.707, 3.499, 3.355, 3.250, 3.169,
3.106, 3.055, 3.012, 2.977, 2.947, 2.921, 2.898, 2.878, 2.861, 2.845,
2.831, 2.819, 2.807, 2.797, 2.787, 2.779, 2.771, 2.763, 2.756, 2.750,
2.744, 2.738, 2.733, 2.728, 2.724, 2.719, 2.715, 2.712, 2.708, 2.704,
2.701, 2.698, 2.695, 2.692, 2.690, 2.687, 2.685, 2.682, 2.680, 2.678,
2.676, 2.674, 2.672, 2.670, 2.668, 2.667, 2.665, 2.663, 2.662, 2.660,
2.659, 2.657, 2.656, 2.655, 2.654, 2.652, 2.651, 2.650, 2.649, 2.648,
2.647, 2.646, 2.645, 2.644, 2.643, 2.642, 2.641, 2.640, 2.640, 2.639,
2.638, 2.637, 2.636, 2.636, 2.635, 2.634, 2.634, 2.633, 2.632, 2.632,
2.631, 2.630, 2.630, 2.629, 2.629, 2.628, 2.627, 2.627, 2.626, 2.626, 2.576
};
//----------------------------------------------------------------------------
double SStatArray::TCrit999[101] =
{
318.313, 22.327, 10.215, 7.173, 5.893, 5.208, 4.782, 4.499, 4.296, 4.143,
4.024, 3.929, 3.852, 3.787, 3.733, 3.686, 3.646, 3.610, 3.579, 3.552,
3.527, 3.505, 3.485, 3.467, 3.450, 3.435, 3.421, 3.408, 3.396, 3.385,
3.375, 3.365, 3.356, 3.348, 3.340, 3.333, 3.326, 3.319, 3.313, 3.307,
3.301, 3.296, 3.291, 3.286, 3.281, 3.277, 3.273, 3.269, 3.265, 3.261,
3.258, 3.255, 3.251, 3.248, 3.245, 3.242, 3.239, 3.237, 3.234, 3.232,
3.229, 3.227, 3.225, 3.223, 3.220, 3.218, 3.216, 3.214, 3.213, 3.211,
3.209, 3.207, 3.206, 3.204, 3.202, 3.201, 3.199, 3.198, 3.197, 3.195,
3.194, 3.193, 3.191, 3.190, 3.189, 3.188, 3.187, 3.185, 3.184, 3.183,
3.182, 3.181, 3.180, 3.179, 3.178, 3.177, 3.176, 3.175, 3.175, 3.174, 3.090
};
//----------------------------------------------------------------------------
// constructor
SStatArray::SStatArray()
{
changed = true;
OutputAllStats = true;
BestFit = 0;
dConfidenceLevel = 900; // 90% level is default
dHigh = dLow = dRange = dMedian = dGeoMean = dHarMean = complex<double>(0,0);
uint i;
for(i = 0 ; i < 8 ; i++)
{
dSum[i] = dSum2[i] = dMean[i] = dStDevEst[i] = dStDevPop[i] = complex<double>(0,0);
dSumXY[i] = dSlope[i] = dYIntercept[i] = dCorr[i] = dStdBetaWeight[i] = dTCorrSignifRatio[i] = 0.;
dStdErrOfEstimateYPop[i] = dStdErrOfEstimateYEst[i] = dTStatisticOfSlope[i] = dTSlopeSignifRatio[i] = 0.;
dStdErrorOfSlope[i] = numeric_limits<double>::max(); // for an error, flag invalid value as a huge value
Allow[i] = true;
}
Allow[QUAD] = false; // quadratic is a special case of polynomial, but I can't figure out how to do it
} // constructor
//----------------------------------------------------------------------------
// copy constructor
// you can use it to copy all or only a subset of the source object's data
SStatArray::SStatArray(SStatArray& other, double lowx, double highx, double lowy, double highy)
{
changed = true;
OutputAllStats = other.OutputAllStats;
BestFit = 0;
dConfidenceLevel = other.dConfidenceLevel;
if(lowx > highx)
swap(lowx,highx);
if(lowy > highy)
swap(lowy,highy);
uint N = other.Count();
for(uint i = 0 ; i < N ; i++)
{
double x = real(other.Dataset[i]), y = imag(other.Dataset[i]);
// test INCLUDES both endpoints.
// And the point is only copied if BOTH its X and Y values are within their ranges.
if((x >= lowx) && (x <= highx) && (y >= lowy) && (y <= highy))
Add(other.Dataset[i]);
}
} // copy constructor
//----------------------------------------------------------------------------
bool SStatArray::operator == (const SStatArray& other) const
{
return(&other == this);
}
//----------------------------------------------------------------------------
// assignment operator creates an exact copy
SStatArray& SStatArray::operator = (const SStatArray& other)
{
if(this == &other)
return(*this);
changed = true;
BestFit = 0;
OutputAllStats = other.OutputAllStats;
dConfidenceLevel = other.dConfidenceLevel;
for(uint i = 0 ; i < other.Count() ; i++)
Add(other.Dataset[i]);
mCalc();
return(*this);
} // assignment operator
//----------------------------------------------------------------------------
// return the last array item
complex<double> SStatArray::Last()
{
uint N = Count();
if(!N)
return(complex<double>(0,0)); // if empty, return 0
return(Dataset[N - 1]); // return the last Y value
}
//----------------------------------------------------------------------------
// add a new value (dependent variable y) to the array. This makes it possible
// to treat the array as an ordinary double variable, but one with a historical record.
// That is, SStatArray s = 24.5 adds 24.5 to the array.
// returns the exact value just added, so it can be used in compound expressions,
// same as assignment of any other numerical type.
double SStatArray::operator = (double d)
{
Add(d);
return(d); // exact passthrough, not (double)(*this), which could change
}
//----------------------------------------------------------------------------
// add a new data pair to the array as a complex. See above.
complex<double> SStatArray::operator = (complex<double> c)
{
Add(c);
return(c); // exact passthrough
}
//----------------------------------------------------------------------------
// allow use as a double.
// The value of the array is that of its most recently added Y (dependent) value.
// It could return alternative values: e.g. its mean, etc.,
// depending on the setting of a member flag.
// To use in some math operations, you must explicitly cast it to (double).
SStatArray::operator double ()
{
return(LastY()); // return the last Y value
}
//----------------------------------------------------------------------------
// allow use as a complex.
// removing this doesn't prevent compiler finding ambiguities in math operations.
SStatArray::operator complex<double> ()
{
return(Last()); // return the latest array item
}
//----------------------------------------------------------------------------
// output (text-formatted) stats to a stream
ostream& operator << (ostream& os, SStatArray& s)
{
s.mCalc();
uint i;
os << "Count " << s.Count() << endl;
os << "Low " << s.Low() << endl;
os << "High " << s.High() << endl;
os << "Range " << s.Range() << endl;
os << "Sum " << s.Sum(0) << endl;
os << "ArithMean " << s.Mean(0) << endl;
os << "GeoMean " << s.GeoMean() << endl;
os << "HarMean " << s.HarMean() << endl;
os << "Median " << s.Median() << endl;
os << "StDevPop " << s.StDevPop(0) << endl;
os << "VariancePop " << s.VariancePop(0) << endl;
os << "StDevEst " << s.StDevEst(0) << endl;
os << "VarianceEst " << s.VarianceEst(0) << endl;
os << "Mode Freq. " << (complex<uint>(s.ModeFreqX(),s.ModeFreqY())) << endl;
os << "XMode(s) ";
for(i = 0 ; i < s.ModesCountX() ; i++)
{
if(i != 0)
os << ", ";
os << s.ModeX(i);
}
os << endl;
os << "YMode(s) ";
for(i = 0 ; i < s.ModesCountY() ; i++)
{
if(i != 0)
os << ", ";
os << s.ModeY(i);
}
os << endl;
os << endl;
os << setiosflags(ios::left);
os << setw(36) << "Curve (+Best Fit) ";
os << setw(14) << "YInt(b)";
os << setw(14) << "Slope(m)";
os << setw(14) << "Beta(m)";
os << setw(14) << "R2";
os << setw(14) << "StdErrEst";
os << setw(14) << "T(of R2)";
os << setw(14) << "T(Slope)";
os << setw(14) << "TRatioR2";
os << setw(14) << "TRatioSlope" << endl;
os << endl;
//os << "----------------------------------- -------------- -------------- --------------" << endl;
for(i = (s.OutputAllStats ? 0 : s.BestFit) ; i < 8 ; i++)
{
if(!s.Allow[i]) // don't output transforms that weren't used
continue;
os << ((i == s.BestFit) ? "+" : " ");
os << setw(36) << (tostring(i) + ": " + s.GetCurveText(i));
os << setw(14) << s.YIntercept(i);
os << setw(14) << s.Slope(i);
os << setw(14) << s.StdBetaWeight(i);
os << setw(14) << s.RSquared(i);
os << setw(14) << s.StdErrOfEstimateYPop(i);
os << setw(14) << s.TStatisticOfCorr(i);
os << setw(14) << s.TStatisticOfSlope(i);
os << setw(14) << s.TCorrSignifRatio(i);
os << setw(14) << s.TSlopeSignifRatio(i);
os << endl;
if(!s.OutputAllStats) // if only BestFit, quit after 1 pass
break;
}
return(os);
}
//----------------------------------------------------------------------------
// write array data only to a stream
ostream& SStatArray::WriteData(ostream& os)
{
for(uint i = 0 ; i < Count() ; i++)
os << real(Dataset[i]) << " " << imag(Dataset[i]) << endl;
return(os);
}
//----------------------------------------------------------------------------
// write array data only to a file
void SStatArray::WriteData(const string& filename)
{
ofstream outfile(filename.c_str());
WriteData(outfile);
}
//----------------------------------------------------------------------------
// load array data from a stream.
// can read either 1 or 2 column input. (any additional columns are ignored)
// any column can have 1 or 2 items. See rules under Add().
// you must read all the data in one call, to prevent merging data from
// multiple files with different formats.
istream& operator >> (istream& is, SStatArray& s)
{
s.Flush(); // disallow appending data
double x, y;
char buf[256];
is >> ws; // ensure first getline has something to read
while(is.getline(buf,sizeof(buf)))
{
istringstream str(buf);
str >> x >> ws;
if(!str.eof()) // if there is anything left, it is Y
{
str >> y;
s.Add(x,y);
}
else
s.Add(x); // this x is really the y variable
is >> ws;
}
return(is);
} // operator >>
//----------------------------------------------------------------------------
// only allows setting to legal levels,
// returns the level that will be used from now on.
uint SStatArray::SetConfidenceLevel(uint newlevel)
{
switch(newlevel)
{
case 900: dConfidenceLevel = 900; break;
case 950: dConfidenceLevel = 950; break;
case 975: dConfidenceLevel = 975; break;
case 990: dConfidenceLevel = 990; break;
case 995: dConfidenceLevel = 995; break;
case 999: dConfidenceLevel = 999; break;
}
changed = true; // all the t statistics must be recalculated
return dConfidenceLevel;
}
//----------------------------------------------------------------------------
// read array data from a file
void SStatArray::ReadData(const string& filename)
{
// 1-22-2010 Breaking this into 2 lines made it compatible with g++.
ifstream is(filename.c_str());
is >> *this;
}
//----------------------------------------------------------------------------
// tabulates given numbers into sum sumofsquares, etc.
void SStatArray::Summate(int i, double tx, double ty)
{
dSum[i] += complex<double>(tx,ty);
dSum2[i] += complex<double>(tx * tx, ty * ty);
dSumXY[i] += tx * ty;
}
//----------------------------------------------------------------------------
// calculate all statistics
// returns: 0 if calculations couldn't be done (no elements)
// 1 if OK
// (this function must not use any of the other calculation functions, because
// they all mCalc, and changed isn't false until THIS Calc() is done. (endless loop))
int SStatArray::Calc()
{
// reset all variables in any event
dModesX.clear(); // empty the modes arrays
dModesY.clear();
ResidualsY.clear();
dGeoMean = complex<double>(std::numeric_limits<double>::quiet_NaN(), std::numeric_limits<double>::quiet_NaN()); // these have no meaning yet
dHarMean = complex<double>(std::numeric_limits<double>::quiet_NaN(), std::numeric_limits<double>::quiet_NaN());
dHigh = dLow = dRange = dMedian = complex<double>(0,0);
dModeFreqX = dModeFreqY = 0;
uint i;
for(i = 0 ; i < 8 ; i++)
{
dSum[i] = dSum2[i] = dMean[i] = dStDevEst[i] = dStDevPop[i] = complex<double>(0,0);
dSumXY[i] = dSlope[i] = dYIntercept[i] = dCorr[i] = dStdBetaWeight[i] = dTStatisticOfCorr[i] = dTCorrSignifRatio[i] = 0.;
dStdErrOfEstimateYPop[i] = dStdErrOfEstimateYEst[i] = dTStatisticOfSlope[i] = dTSlopeSignifRatio[i] = 0.;
dStdErrorOfSlope[i] = numeric_limits<double>::max();
Allow[i] = true;
}
Allow[QUAD] = false; // can't figure out how to do it
AllowGHMeanX = AllowGHMeanY = true;
// now see if there is any data to operate on
uint N = Count();
if(!N)
return(0);
// High, Low, Sum, Range, Means
dHigh = complex<double>(-numeric_limits<double>::max(), -numeric_limits<double>::max());
dLow = complex<double>(numeric_limits<double>::max(), numeric_limits<double>::max());
for(i = 0 ; i < N ; i++)
{
double x = real(Dataset[i]);
double y = imag(Dataset[i]);
dHigh = complex<double>(max(real(dHigh),x), max(imag(dHigh),y));
dLow = complex<double>(min(real(dLow),x), min(imag(dLow),y));
if(x == 0) Allow[HYPER] = false;
if(y == 0) Allow[INVLINE] = false;
if(x <= 0) Allow[LOG] = Allow[POW] = AllowGHMeanX = false;
if(y <= 0) Allow[EXP] = Allow[POW] = AllowGHMeanY = false;
// each summation receives data until a data point makes it no longer allowable
if(Allow[LINE]) Summate(LINE, x, y); // no adjustment
if(Allow[LOG]) Summate(LOG, log(x), y); // TI59:SA:84
if(Allow[PARA]) Summate(PARA, x * x, y);
if(Allow[HYPER]) Summate(HYPER, 1 / x, y);
if(Allow[INVLINE]) Summate(INVLINE, x, 1 / y);
if(Allow[EXP]) Summate(EXP, x, log(y)); // TI59:PP:v39
if(Allow[POW]) Summate(POW, log(x), log(y)); // TI59:SA:84
if(Allow[QUAD]) Summate(QUAD, x, y); // no idea; y = Ax*x + Bx + C?
}
dRange = dHigh - dLow;
for(i = 0 ; i < 8 ; i++)
dMean[i] = dSum[i] / (double)N;
// GEOMETRIC AND HARMONIC MEANS FOR X
if(AllowGHMeanX)
{
double d;
double g = 1;
double h = 0;
for(i = 0 ; i < N ; i++)
{
d = real(Dataset[i]);
g *= pow(d, 1 / (double)N);
h += (1 / d);
}
dGeoMean = complex<double>(g, imag(dGeoMean));
dHarMean = complex<double>(N / h, imag(dHarMean));
}
// GEOMETRIC AND HARMONIC MEANS FOR Y
if(AllowGHMeanY)
{
double d;
double g = 1;
double h = 0;
for(i = 0 ; i < N ; i++)
{
d = imag(Dataset[i]);
g *= pow(d, 1 / (double)N);
h += (1 / d);
}
dGeoMean = complex<double>(real(dGeoMean), g);
dHarMean = complex<double>(real(dHarMean), N / h);
}
//---------- MEDIAN, MODE
int center = N / 2;
vector<double> s; // a separate array that we can sort
//---------- MEDIAN X: FIND HALFWAY POINT IN X PORTION
for(i = 0 ; i < N ; i++)
s.push_back(real(Dataset[i]));
sort(s.begin(),s.end()); // default sort is ascending
// if even # of elements, interpolate to get a "center" point
// (if N is 1 (and thus center is 0), s[center-1] below won't be calculated)
double mx = ((N % 2) == 1) ? s[center] : (s[center - 1] + s[center]) / 2;
//---------- MODE X. Allows for multimodal data sets.
dModeFreqX = 0; // count starts at 0
double thisnum = s[0]; // each new number encountered, initialized to the first element
uint tally = 1; // temporary counter for numbers encountered; first element has occurred once so far
for(i = 1 ; i < N ; i++)
{
if(s[i] == thisnum) // if it's another occurrence,
tally++; // just increment the counter
else // else if we hit a new #
{ // first decide if the old number is a mode candidate.
if(tally == dModeFreqX) // if tally is a tie, add number to the modes list
dModesX.push_back(thisnum);
if(tally > dModeFreqX) // if there is a new higher frequency,
{
dModesX.clear(); // delete all previous mode candidates
dModesX.push_back(thisnum); // add this one to the list
dModeFreqX = tally; // and update the highest count counter
}
thisnum = s[i]; // now start tallying the new number
tally = 1; // it has already occurred once
}
}
if(tally == dModeFreqX) // final check: maybe the last # was also a potential mode
dModesX.push_back(thisnum);
if(tally > dModeFreqX)
{
dModesX.clear(); // delete all previous mode candidates
dModesX.push_back(thisnum); // add this one to the list
dModeFreqX = tally; // and update the highest count counter
}
// If there was only 1 input value, it's ok to let it be the mode,
// but if there were multiple input values, minimum frequency for the mode is 2.
if(s.size() > 1)
if(dModeFreqX < 2)
{
dModesX.clear(); // No legitimate mode found.
dModeFreqX = 0; // No occurrences.
}
//---------- MEDIAN Y: FIND HALFWAY POINT IN Y PORTION
s.clear();
for(i = 0 ; i < N ; i++)
s.push_back(imag(Dataset[i]));
sort(s.begin(),s.end());
double my = ((N % 2) == 1) ? s[center] : (s[center - 1] + s[center]) / 2;
dMedian = complex<double>(mx,my);
//---------- MODE Y
dModeFreqY = 0; // count starts at 0
thisnum = s[0]; // each new number encountered, initialized to the first element
tally = 1; // temporary counter for numbers encountered; first element has occurred once so far
for(i = 1 ; i < N ; i++)
{
if(s[i] == thisnum) // if it's another occurrence,
tally++; // just increment the counter
else // else if we hit a new #
{ // first decide if the old number is a mode candidate.
if(tally == dModeFreqY) // if tally is a tie, add number to the modes list
dModesY.push_back(thisnum);
if(tally > dModeFreqY) // if there is a new higher frequency,
{
dModesY.clear(); // delete all previous mode candidates
dModesY.push_back(thisnum); // add this one to the list
dModeFreqY = tally; // and update the highest count counter
}
thisnum = s[i]; // now start tallying the new number
tally = 1; // it has already occurred once
}
}
if(tally == dModeFreqY) // final check: maybe the last # was also a potential mode
dModesY.push_back(thisnum);
if(tally > dModeFreqY)
{
dModesY.clear(); // delete all previous mode candidates
dModesY.push_back(thisnum); // add this one to the list
dModeFreqY = tally; // and update the highest count counter
}
if(s.size() > 1)
if(dModeFreqY < 2)
{
dModesY.clear(); // No legitimate mode found.
dModeFreqY = 0; // No occurrences.
}
s.clear(); // frees space if it's large
// STANDARD DEVIATIONS AND REGRESSIONS
for(i = 0 ; i < 8 ; i++)
{
if(!Allow[i]) // is this transformation legal?
continue;
// StDevEst, StDevPop
// these eventual operands of sqrt will never evaluate to negative, but can be zero...
double b = (N * real(dSum2[i])) - (real(dSum[i]) * real(dSum[i]));
double c = (N * imag(dSum2[i])) - (imag(dSum[i]) * imag(dSum[i]));
// however, due to floating point rounding errors, when they should be 0, they can be tiny negative fractions
if(b < 0) b = 0;
if(c < 0) c = 0;
b = sqrt(b);
c = sqrt(c);
// Pop calculation is always valid. If N==1, Pop and Est are both 0. If N>1, value of Est gets overwritten in the next line.
dStDevEst[i] = dStDevPop[i] = complex<double>(b/N, c/N);
if(N > 1)
dStDevEst[i] = complex<double>(b/sqrt((double)N * (N - 1)), c/sqrt((double)N * (N - 1)));
// LINEAR REGRESSION
// look for a version that minimizes chances of overflow, divide by zero, rounding errors,
// and works best for the multicurve fitting.
// this set of calculation methods is from TI59 book (use of dMean in intercept calc is from Excel)
// it correctly calculates YInt for 10,20,30 as 0.0
double a = dSumXY[i] - (real(dSum[i]) * imag(dSum[i]) / N);
// b and c are already guaranteed non-negative; here they must not be 0, either.
if(b > 0.)
{
if(a == 0)
dSlope[i] = 0;
else
dSlope[i] = a / (real(dSum2[i]) - (real(dSum[i]) * real(dSum[i]) / N));
dYIntercept[i] = imag(dMean[i]) - (dSlope[i] * real(dMean[i]));
}
//for (c==0) horizontal line, dCorr[i] is left at 0, which is correct for the
//correlation between X and Y, but it may not mean that all the points fall
//exactly ON the line (?); can they cloud around it? What is the measure of that?
//probably just the mean, stdev, etc.
//Multiple regression seems to use R-Squared, which IS r squared.
//For our purpose here, r is probably more correct, although operator << displays R2 instead.
//R2, the coefficient of determination magnifies the distance from 1.0,
//and is the proportion of variance in Y that is attributable to changes in X.
//It appears that the unbiased estimates of the standard deviations are the correct ones to use here,
//but you can use either. The N or N-1 factors all cancel out anyway.
if(c > 0.)
{
if(imag(dStDevEst[i]) == 0)
dCorr[i] = 0;
else
dCorr[i] = dSlope[i] * real(dStDevEst[i]) / imag(dStDevEst[i]);
}
// this alternative set of calculation methods was probably from stats books,
// and probably carefully checked, but check again.
// it calculates YInt for 10,20,30 as ...e-14, and uses StDev a lot. (est vs. pop?)
// if((b > 0.) && (c > 0.))
// {
// double a = (N * dSumXY[0]) - (real(dSum[0]) * imag(dSum[0]));
// dCorr[0] = a / (b * c);
// dSlope[0] = dCorr[0] * imag(dStDevPop[0]) / real(dStDevPop[0]);
// // why is this formula so much different from PredictY (and so complicated?)
// dYIntercept[0] = (-dCorr[0] * imag(dStDevPop[0]) * real(dMean[0]) / real(dStDevPop[0])) + imag(dMean[0]);
// }
// STANDARDIZED BETA WEIGHT OF PREDICTOR X AT PREDICTING Y FROM X.
// The ==0 test prevents divide by zero. Using weight of 0 is ok: if there is no variance in Y (the dependent), there is no variance to account for.
// I've tested the result against the direct method of converting raw scores to z-scores before calculating the slope (weight), and they match
// with no differences other than rounding errors.
dStdBetaWeight[i] = (imag(dStDevEst[i]) == 0 ? 0 : dSlope[i] * real(dStDevEst[i]) / imag(dStDevEst[i]));
// STANDARD ERROR OF ESTIMATE OF Y ON X (population value), Minium p.180, uses N as denominator.
dStdErrOfEstimateYPop[i] = imag(dStDevPop[i]) * sqrt(1.0 - (dCorr[i] * dCorr[i]));
dStdErrOfEstimateYEst[i] = dStdErrOfEstimateYPop[i]; // ensures a valid figure even if N < 3
// Student's t to evaluate significance of the SLOPE
// (see notes in code block for Student's t on the correlation)
if(N > 2)
{
// UNBIASED STANDARD ERROR OF ESTIMATE OF Y ON X
//dStdErrOfEstimateYEst[i] = imag(dStDevEst[i]) * sqrt(((N - 1) / (N - 2)) * (1.0 - (dCorr[i] * dCorr[i]))); // originally
//dStdErrOfEstimateYEst[i] = imag(dStDevPop[i]) * sqrt((N / (N - 2)) * (1.0 - (dCorr[i] * dCorr[i]))); // using Pop as the basis
dStdErrOfEstimateYEst[i] = dStdErrOfEstimateYPop[i] * sqrt((double)N / (N - 2)); // so this should be equivalent, using Pop
// STUDENT'S T TO EVALUATE SIGNIFICANCE OF dSlope (NOT THE STANDARDIZED BETA WEIGHT)
// t = slope / standard error of slope
// #error standard error of slope should be made a member variable because it is useful:
// With the confidence level you used, you know that the slope is between (slope - stdError) and (slope + stdError)
// method in next line is from http://www.tc3.edu/instruct/sbrown/stat/infregr.htm which has a useful example walkthrough.
//dStdErrorOfSlope[i] = dStdErrOfEstimateYEst[i] / sqrt(real(dSum2[i]) - ((real(dSum[i]) * real(dSum[i])) / N));
dStdErrorOfSlope[i] = dStdErrOfEstimateYEst[i] / (real(dStDevPop[i]) * sqrt((double)N)); // this is equivalent
dTStatisticOfSlope[i] = fabs(dSlope[i] / dStdErrorOfSlope[i]); // fabs prevents negative t
// evaluate the significance of t
uint df = N - 2; // degrees of freedom for this t-test
if(df > 100)
df = 101; // above 100, it's standard normal curve, and all use the same value
df--; // convert it to an array index
double critvalue = numeric_limits<double>::max(); // failsafe large nonzero value if no cases below apply
switch(dConfidenceLevel)
{
case 900: critvalue = TCrit900[df]; break;
case 950: critvalue = TCrit950[df]; break;
case 975: critvalue = TCrit975[df]; break;
case 990: critvalue = TCrit990[df]; break;
case 995: critvalue = TCrit995[df]; break;
case 999: critvalue = TCrit999[df]; break;
}
dTSlopeSignifRatio[i] = dTStatisticOfSlope[i] / critvalue;
}
else
{
dTStatisticOfSlope[i] = 0; // redundancy (it was set to 0 earlier)
dTSlopeSignifRatio[i] = 0;
}
// Student's t to evaluate significance of the CORRELATION COEFFICIENT
// Method: Minium p.319 and TI Applied Statistics p.5-7; df for this t test is (N-2), minimum legal df = 1
if(N > 2)
{
//dTStatisticOfCorr[i] = dCorr[i] / sqrt((1.0 - (dCorr[i] * dCorr[i])) / (N - 2)); // Minium p.319 allows negative t
double d2 = dCorr[i] * dCorr[i];
if(d2 == 1.0) // if dCorr is 1.0, t is infinite
dTStatisticOfCorr[i] = numeric_limits<double>::max(); // prevent divide by zero
else
dTStatisticOfCorr[i] = sqrt(((N - 2) * d2) / (1.0 - d2)); // TI59 equivalent formula is always positive
// evaluate the significance of t
uint df = N - 2; // degrees of freedom for this t-test
if(df > 100)
df = 101; // above 100, it's standard normal curve, and all use the same value
df--; // convert it to an array index
double critvalue = numeric_limits<double>::max(); // failsafe large nonzero value if no cases below apply
switch(dConfidenceLevel)
{
case 900: critvalue = TCrit900[df]; break;
case 950: critvalue = TCrit950[df]; break;
case 975: critvalue = TCrit975[df]; break;
case 990: critvalue = TCrit990[df]; break;
case 995: critvalue = TCrit995[df]; break;
case 999: critvalue = TCrit999[df]; break;
}
dTCorrSignifRatio[i] = dTStatisticOfCorr[i] / critvalue;
}
else
{
dTStatisticOfCorr[i] = 0; // redundancy (it was set to 0 earlier)
dTCorrSignifRatio[i] = 0;
}
} // end for(i = 0 ; i < 8 ; i++)
// determine best fit (correlation nearest to 1 or -1)
// #error now that statistical significance tests are available, raw RSquared might not be the best measure of best fit
// revised 1/24/08 to use R2.
BestFit = 0;
double compare = 0;
double squared;
for(i = 0 ; i < 8 ; i++)
if(Allow[i])
{
squared = dCorr[i] * dCorr[i];
if(squared > compare)
{
compare = squared;
BestFit = i;
}
}
changed = false;
return(1);
} // Calc
//----------------------------------------------------------------------------
// Correlation coefficient
double SStatArray::Corr(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i])
return(std::numeric_limits<double>::quiet_NaN());
return(dCorr[i]);
} // Corr
//----------------------------------------------------------------------------
// coefficient of determination, percentage of Y variance attributable to changes in X
double SStatArray::RSquared(int i)
{
double d = Corr(i); // Corr() does mCalc()
if(d == std::numeric_limits<double>::quiet_NaN())
return d;
else
return d * d;
}
//----------------------------------------------------------------------------
// Student's t statistic appropriate for assessing significance of Corr().
// We are testing the hypothesis that the true population correlation between x and y IS 0.
// If t exceeds the critical value, we reject that hypothesis and conclude that there IS correlation between X and Y.
// For this purpose, degrees of freedom = (number of data pairs - 2).
double SStatArray::TStatisticOfCorr(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i] || (Count() < 3)) // df for this t test is (N-2), minimum legal df = 1
return(0);
return dTStatisticOfCorr[i];
}
//----------------------------------------------------------------------------
// Ratio of t to the critical value of t for current df
double SStatArray::TCorrSignifRatio(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i] || (Count() < 3)) // df for t test is (N-2), minimum legal df = 1
return(0);
return dTCorrSignifRatio[i];
}
//----------------------------------------------------------------------------
// Students t for assessing whether slope is significantly different from zero
double SStatArray::TStatisticOfSlope(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i] || (Count() < 3)) // unsure if 3 is the correct minimum value here
return(0);
return dTStatisticOfSlope[i];
}
//----------------------------------------------------------------------------
// Ratio of t to the critical value of t for current df
double SStatArray::TSlopeSignifRatio(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i] || (Count() < 3)) // unsure if 3 is the correct minimum value here
return(0);
return dTSlopeSignifRatio[i];
}
//----------------------------------------------------------------------------
// Standard Error of the Slope (determines upper and lower limits of true slope, within confidence)
double SStatArray::StdErrorOfSlope(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i] || (Count() < 3)) // unsure if 3 is the correct minimum value here
return(numeric_limits<double>::max());
return dStdErrorOfSlope[i];
}
//----------------------------------------------------------------------------
// see Texas Instruments TI59 Programmable Calculator, Statistical Analysis module, p.83 for description of curve types
double SStatArray::Slope(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i])
return(std::numeric_limits<double>::quiet_NaN());
double m = dSlope[i];
//switch(i)
//{
// case LINE : break;
// case LOG : break;
// case PARA : break;
// case HYPER : break;
// case INVLINE: break;
// case EXP : break;
// case POW : break;
// case QUAD : break;
//}
return(m);
} // Slope
//----------------------------------------------------------------------------
// returns the de-transformed YIntercept value: this is the value you need
// to know to use the exponentiated version of the equation.
// (as opposed to the transposed-to-linear version, which the program MUST use internally)
double SStatArray::YIntercept(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i])
return(std::numeric_limits<double>::quiet_NaN());
double b = dYIntercept[i];
switch(i) // adjust if entry changes were made to Y (because b is in Y axis units?)
{
case LINE : break;
case LOG : break; // TI59:SA:84
case PARA : break; // no adj: tested against Excel graph
case HYPER : break; // no adj: tested against Excel graph
case INVLINE: break; // no adj: m,b describe an x-axis point for INVLINE
case EXP : b = exp(b); break; // TI59:SA:83
case POW : b = exp(b); break; // TI59:SA:84
case QUAD : break; // no idea
}
return(b);
} // YIntercept
//----------------------------------------------------------------------------
// multiple regression version of Slope(), where X and Y arrays are z-scores; YIntercept is always 0.
// used for determining which predictor of many has the largest influence on Y, after converting all to normalized z-scores.
double SStatArray::StdBetaWeight(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i])
return(std::numeric_limits<double>::quiet_NaN());
double m = dStdBetaWeight[i];
// I don't think any of these need adjustment, same as Slope()
//switch(i)
//{
// case LINE : break;
// case LOG : break;
// case PARA : break;
// case HYPER : break;
// case INVLINE: break;
// case EXP : break;
// case POW : break;
// case QUAD : break;
//}
return(m);
}
//----------------------------------------------------------------------------
double SStatArray::XIntercept(int i)
{
// mCalc(); // PredictX does all this
// if(i == -1)
// i = BestFit;
double d = PredictX(0, i); // PredictX checks for !Allow[i]
return(d);
}
//----------------------------------------------------------------------------
//1/17/2009 Although this is the way it is often informally done, PredictX is technically wrong.
//There are two regression equations and two regression lines,
//Y on X and X on Y, which respectively minimize the least squares differences in either Y or X,
//(minimizing the deviations from the regression line in either the vertical or horizontal direction).
//For predicting X from Y, you'd need to calculate the other regression line, which isn't currently done.
//To implement, these member variables would need to be made complex<double>:
//For their calculations, you just exchange X values with Y values in the regression equations.
//double dSlope[8];
//double dYIntercept[8];
//double dCorr[8];
double SStatArray::PredictX(double y, int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i])
return(std::numeric_limits<double>::quiet_NaN());
// even if the transformation is legal based on the data IN the array,
// it may be illegal to transform the new Y value provided:
if((y == 0) && (i == INVLINE)) return(std::numeric_limits<double>::quiet_NaN());
if((y <= 0) && ((i == EXP) || (i == POW))) return(std::numeric_limits<double>::quiet_NaN());
// transform y before using it, in the same manner it was transformed during data entry
switch(i)
{
case LINE : break;
case LOG : break; // TI59:SA:84
case PARA : break; // a guess
case HYPER : break; // a guess
case INVLINE: y = 1 / y; break; // a guess
case EXP : y = log(y); break; // TI59:PP:v39
case POW : y = log(y); break; // TI59:SA:84
case QUAD : break; // no idea
}
double x = (y - dYIntercept[i]) / dSlope[i];
// x may be valid, but its upcoming transformation may cause a math error, so:
if((x == 0) && (i == HYPER)) return(std::numeric_limits<double>::quiet_NaN());
if((x < 0) && (i == PARA)) return(std::numeric_limits<double>::quiet_NaN());
// re-transform result to raw data (if you transformed X upon data entry?)
switch(i)
{
case LINE : break;
case LOG : x = exp(x); break; // TI59:SA:84
case PARA : x = sqrt(x); break; // a guess
case HYPER : x = 1 / x; break; // a guess
case INVLINE: break; // a guess
case EXP : break; // no adj, TI59:PP:v39
case POW : x = exp(x); break; // TI59:SA:84
case QUAD : x = x; break; // no idea, probably does need adjusting
}
return(x);
} // PredictX
//----------------------------------------------------------------------------
double SStatArray::PredictY(double x, int i)
{
mCalc();
if(i == -1)
i = BestFit;
if(!Allow[i])
return(std::numeric_limits<double>::quiet_NaN());
// even if the transformation is legal based on the data IN the array,
// it may be illegal to transform the new X value provided:
if((x == 0) && (i == HYPER)) return(std::numeric_limits<double>::quiet_NaN());
if((x <= 0) && ((i == LOG) || (i == POW))) return(std::numeric_limits<double>::quiet_NaN());
// transform x before using it in the same manner x was transformed upon data entry
switch(i)
{
case LINE : break;
case LOG : x = log(x); break; // TI59:SA:84
case PARA : x = x * x; break; // a guess
case HYPER : x = 1 / x; break; // a guess
case INVLINE: break; // a guess
case EXP : break; // no adj, TI59:PP:v39
case POW : x = log(x); break; // TI59:SA:84
case QUAD : x = x; break; // no idea, probably does need adjusting
}
double y = dSlope[i] * x + dYIntercept[i];
// y may be valid, but its upcoming transformation may cause a math error, so:
if((y == 0) && (i == INVLINE)) return(std::numeric_limits<double>::quiet_NaN());
// un-transform the result (with the inverse operation done on Y upon data entry)
switch(i)
{
case LINE : break;
case LOG : break; // TI59:SA:84
case PARA : break; // a guess
case HYPER : break; // a guess
case INVLINE: y = 1 / y; break; // a guess
case EXP : y = exp(y); break; // TI59:PP:v39
case POW : y = exp(y); break; // TI59:SA:84
case QUAD : break; // no idea
}
return(y);
} // PredictY
//----------------------------------------------------------------------------
// provide text description of best fit (or any requested) curve
string SStatArray::GetCurveText(int i)
{
mCalc();
if(i == -1)
i = BestFit;
switch(i)
{
case LINE : return("Linear y = b + mx");
case LOG : return("Logarithmic y = b + mLN(x)");
case PARA : return("Parabolic y = b + mxx");
case HYPER : return("Hyperbolic y = b + m/x");
case INVLINE: return("Inverse Linear y = 1 / (b + mx)");
case EXP : return("Exponential y = be^(mx)");
case POW : return("Power y = bx^m");
case QUAD : return("Quadratic");
}
return("Illegal Curve Number");
}
//----------------------------------------------------------------------------
// return prediction of the next point (x,y) in the sequence:
// next x = lastX + XRange / (N-1) (assumes equal intervals between X values)
// next y = PredictY(x)
// main intended use is for Adapt.cpp (my artificial life evolution program) or similar agents.
// Note this only returns 1 prediction; repeated calls don't return successive predicted values.
complex<double> SStatArray::Next(int i) // you can specify which model to use
{
int N = Count();
if(N < 2) // to predict, you need 2 points, & prevent div by 0 below
return(complex<double>(std::numeric_limits<double>::quiet_NaN(),std::numeric_limits<double>::quiet_NaN()));
mCalc();
if(i == -1)
i = BestFit;
// could use dRange[BestFit], then untransform it (?? but dRange is not an array)
double nextx = real(Dataset[N - 1]) + (real(dRange) / (N - 1));
double nexty = PredictY(nextx, i);
return(complex<double>(nextx,nexty));
} // Next
//----------------------------------------------------------------------------
// Population (biased, N-weighted) Standard Error of Estimate of Y predicted from X
double SStatArray::StdErrOfEstimateYPop(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if((Count() < 2) || !Allow[i]) // minimum 2 points to calculate a trend line
return(std::numeric_limits<double>::quiet_NaN());
return dStdErrOfEstimateYPop[i];
}
//----------------------------------------------------------------------------
// Population (biased, N-weighted) Standard Error of Estimate of Y predicted from X
double SStatArray::StdErrOfEstimateYEst(int i)
{
mCalc();
if(i == -1)
i = BestFit;
if((Count() < 3) || !Allow[i]) // minimum 3 points for unbiased estimator
return(std::numeric_limits<double>::quiet_NaN());
return dStdErrOfEstimateYEst[i];
}
//----------------------------------------------------------------------------
// calculate Y residuals, store in ResidualsY, and
// return unbiased Standard Error of Estimate of Y predicted from X (Estimated value, N-2 weighting)
double SStatArray::StandardErrorY(int i)
{
mCalc();
ResidualsY.clear();
if(i == -1)
i = BestFit;
uint N = Count();
if((N < 3) || !Allow[i]) // minimum 3 points when using N-2 weighting
return(std::numeric_limits<double>::quiet_NaN());
double residue = 0;
double d, e;
for(uint j = 0 ; j < N ; j++)
{
d = PredictY(real(Dataset[j]), i);
e = imag(Dataset[j]) - d;
residue += (e * e);
ResidualsY.push_back(e);
}
// for comparison, from Minium p.180: (this is the equivalent of using (N) below instead of (N-2)).
// dStdErrOfEstimateYPop[i] = imag(dStDevPop[i]) * sqrt(1.0 - (dCorr[i] * dCorr[i]));
// note this is Population standard deviation of the actual data point deviations,
// which matches book examples and the dStdErrorOfEstimate calc in Calc(), but it still seems N-1 would be more appropriate.
// 2-22-2009: see below; (N-2) is apparently the accepted standard
//return sqrt(residue / N); // from Minium p.177, possibly simplified for teaching purposes?
return sqrt(residue / (N - 2)); // from websites dealing with practical calculation
}
//----------------------------------------------------------------------------
//find the value of Y most commonly paired with a particular X (when lowx == highx),
//or with a particular range of X values.
//The expected use for this is with time series, or where the values in your
//arrays are actually codes corresponding to consecutive occurrences or states,
//and where you want to know, for example, "When state X occurred, what was the
//state Y that most often followed it?" Once you have the answer, this is a
//situation where you could use a Chi-squared test to determine if the relationship
//is actually significant and you should pay attention to it. (Although in most
//cases, even if it isn't, it still has to be your best guess.)
//
//returns a complex: mode is in real(), its frequency in imag()
//(unlike the others to allow calling the function only once, because it can't just
//return a data member)
//
//Leave this in! You can use the copy constructor to create a subset, but THAT
//subset doesn't contain the information about the other Y values that exist
//(they were excluded), so you can't get expected frequencies for Chi-squared.
//That is, to calc chi squared, you must know how many Y values there are,
//so you can assign equal expected probability to each.
//add an option allowing caller to specify which mode to return in case of conflict: lowest, highest, other?
complex<double> SStatArray::ModeYForXBetween(double lowx, double highx)
{
uint N = Count();
if(!N)
return(complex<double>(std::numeric_limits<double>::quiet_NaN(),std::numeric_limits<double>::quiet_NaN()));
if(lowx > highx)
swap(lowx,highx);
uint i;
vector<double> s; // sorted array automatically orders the members
for(i = 0 ; i < N ; i++)
{
double x = real(Dataset[i]);
if((x >= lowx) && (x <= highx)) // test INCLUDES both endpoints
s.push_back(imag(Dataset[i]));
}
sort(s.begin(), s.end());
N = s.size();
if(!N) // none met the criteria
return(complex<double>(std::numeric_limits<double>::quiet_NaN(),std::numeric_limits<double>::quiet_NaN()));
// 1-19-2008 This uses the old method, which can calculate only 1 unique mode.
// don't bother revising this until I need the function for something.
double modey = s[0]; // start with first as the most common
int ty = 1; // and it occurs once
double thisnum = s[0]; // each new number encountered
int tally = 1; // temporary counter for numbers encountered
for(i = 1 ; i < N ; i++)
{
if(s[i] == thisnum) // if it's another occurrence,
tally++; // just increment the counter
else // else if we hit a new #
{
if(tally > ty) // first check if the previous one should be the new mode
{
modey = thisnum; // thisnum is now the previous one
ty = tally; // and update the highest count counter
}
thisnum = s[i]; // now start tallying the new number
tally = 1; // it has already occurred once
}
}
if(tally > ty) // final check: maybe the last # was the most common
{
modey = thisnum;
ty = tally;
}
return(complex<double>(modey,ty));
} // ModeYForXBetween
//----------------------------------------------------------------------------
// Convert 1-dimensional data to a time series, such that for each point
// Y is moved into X, and Y becomes the value that followed it in the series.
// (1,10)(2,20)(3,30) becomes (10,20)(20,30). (result set has 1 fewer points)
// No idea what this might be used for, but might be a handy way to get such
// data pairs prepared for transfer to Excel, or be useful if used in a program
// where plotting capabilities are available.
// Like with the ChiSq application, this uses the array differently from normal.
// If you subsequently use the statistical functions, be sure you know what you're doing.
// (since the data was moved around)
// (untested)
bool SStatArray::ConvertToTimeSeries()
{
uint N = Count();
if(N < 2) // minimum 2 pairs to have 1 resulting pair
return(false);
uint last = N - 1;
for(uint i = 0 ; i < last ; i++)
Dataset[i] = complex<double>(imag(Dataset[i]),real(Dataset[i + 1]));
Dataset.erase(Dataset.end()); // last point has no next (and set changed to true)
changed = true;
return(true);
} // ConvertToTimeSeries
//----------------------------------------------------------------------------
// Does the best it can to reconvert a time series back to normal form.
// Uses 1,2,3... for the X values. (10,20)(20,30) becomes (1,10)(2,20)(3,30)
// (untested)
bool SStatArray::ConvertFromTimeSeries()
{
uint N = Count();
if(!N) // no data
return(false);
double lasty;
for(uint i = 0 ; i < N ; i++)
{
lasty = imag(Dataset[i]);
Dataset[i] = complex<double>(i + 1, real(Dataset[i]));
}
Add(lasty); // restore a point for the final value (and will set changed to true)
return(true);
} // ConvertFromTimeSeries
//----------------------------------------------------------------------------
// Convert all entries in X and/or Y arrays to standard z-scores.
// returns true if successful, false if the operation could not be performed.
// Because the original means and standard deviations are not saved,
// there is no way to convert the arrays back to raw scores.
// In addition, the z-scores will have zero and negative values, making the
// geometric and harmonic means incalculable and some of the regression
// data transformations disallowed. Thus, in many cases it is preferable
// to use separate SStatArrays for the converted and unconverted scores.
// The productive web search to find the method for this was: how to standardize regression coefficients.
bool SStatArray::ConvertToZScores(bool convertx, bool converty)
{
if(!convertx && !converty) // no conversion requested
return(false);
uint N = Count();
if(N < 2) // not enough data
return(false);
double stdevestX = StDevEstX(0); // save while these are still valid for the old data
double stdevestY = StDevEstY(0);
double meanX = MeanX(0);
double meanY = MeanY(0);
double x, y;
for(uint i = 0 ; i < N ; i++)
{
x = real(Dataset[i]);
y = imag(Dataset[i]);
if(convertx)
{
// Must prevent divide by zero.
// If the variance is zero, it means all values are the same, and thus all have z-score of 0.
if(stdevestX == 0)
x = 0;
else
x = (x - meanX) / stdevestX;
}
if(converty)
{
if(stdevestY == 0)
y = 0;
else
y = (y - meanY) / stdevestY;
}
Dataset[i] = complex<double>(x,y);
}
changed = true;
return(true);
}
//----------------------------------------------------------------------------
// Chi squared calculations use the array differently from the others.
// Array is expected to contain expected frequencies in X (real), observed frequencies in Y (imag)
// returns the chi squared values, real = without Yates correction, imag = with Yates.
// to be valid, all expected values should be >= 5, and MUST be > 0.
complex<double> SStatArray::ChiSq()
{
int N = Count();
complex<double> chisq = complex<double>(0,0);
for(int i = 0 ; i < N ; i++)
{
if(real(Dataset[i]) == 0) // prevent divide by 0
continue;
double noyates = imag(Dataset[i]) - real(Dataset[i]);
double yates = fabs(imag(Dataset[i]) - real(Dataset[i])) - .5;
chisq += complex<double>((noyates * noyates) / real(Dataset[i]), (yates * yates) / real(Dataset[i]));
}
return(chisq);
} // ChiSq
//----------------------------------------------------------------------------
// performs a Chi squared test on the array.
// as noted under ModeYForXBetween, even if your best guess isn't significant,
// you may still have to use it. But if you are tracking relationships and memory
// is close to full, you could use a Chi squared test to choose which relationships
// to stop tracking, and delete.
//
// returns true if array is NOT normally distributed (IS significantly different from expected)
// false if array matches expected frequencies (is NOT significantly different)
bool SStatArray::ChiSqIsSignif(bool useYates)
{
int N = Count();
if(N < 2) // minimum 2 pairs so V >= 1. V=0 is illegal.
return(false);
double chisq = useYates ? imag(ChiSq()) : real(ChiSq());
double criticalvalue = GetC5();
return(chisq > criticalvalue);
} // ChiSqIsSignif
//----------------------------------------------------------------------------
// returns the 5% critical value for the # of items in the array.
double SStatArray::GetC5()
{
int N = Count();
if(N < 2) // minimum 2 pairs so V >= 1. V=0 is illegal.
return(std::numeric_limits<double>::quiet_NaN()); // an obvious error condition value
double criticalvalue;
double dF = N - 1; // degrees of freedom
if(dF <= 30)
criticalvalue = C5[(uint)dF - 1]; // dFs start at 1, array indices start at 0
else
if(dF > 100)
criticalvalue = .5 * (1.6449 + sqrt(2 * dF - 1)) * (1.6449 + sqrt(2 * dF - 1));
else
if(dF == 100)
criticalvalue = 124.3421013;
else // > 30 and < 100
criticalvalue = dF * pow(1 - 2/(9 * dF) + 1.6449 * sqrt(2/(9 * dF)),3);
return(criticalvalue);
} // GetC5
//----------------------------------------------------------------------------
// returns the 10% critical value for the # of items in the array.
double SStatArray::GetC10()
{
int N = Count();
if(N < 2) // minimum 2 pairs so V >= 1. V=0 is illegal.
return(std::numeric_limits<double>::quiet_NaN()); // an obvious error condition value
double criticalvalue;
int dF = N - 1; // degrees of freedom
if(dF <= 30)
criticalvalue = C10[dF - 1]; // dFs start at 1, array indices start at 0
else
if(dF > 100)
criticalvalue = numeric_limits<double>::max(); // error: calculation unknown
else
if(dF == 100)
criticalvalue = 118.4980018;
else // > 30 and < 100
criticalvalue = numeric_limits<double>::max(); // error: calculation unknown
return(criticalvalue);
} // GetC10
//----------------------------------------------------------------------------
// Array functions
//----------------------------------------------------------------------------
void SStatArray::Add(complex<double> c)
{
changed = true;
Dataset.push_back(c);
}
//----------------------------------------------------------------------------
// allow providing x and y as separate doubles
void SStatArray::Add(double x, double y)
{
Add(complex<double>(x,y));
}
//----------------------------------------------------------------------------
// if you only provide one value, it is the dependent variable, stored in y.
// x is an automatic counter, starting at 1 (not 0) to allow multicurve fitting.
// (0 is an illegal value in some of the models)
void SStatArray::Add(double y)
{
double x;
uint N = Count();
// if x is regularly spaced, you need only enter enough x values to establish the interval
// you can change the interval at any point by entering enough x values to establish it
switch(N)
{
case 0: // no info available, start at 1
x = 1;
break;
case 1: // no interval, assume + 1
x = real(Dataset[0]) + 1;
break;
default: // use previous interval
x = real(Dataset[N - 1]) + (real(Dataset[N - 1]) - real(Dataset[N - 2]));
break;
}
Add(complex<double>(x,y));
}
//----------------------------------------------------------------------------
// adds each char, to s.length() (including null chars)
// allows calculating stats on a string. will be used for wtalk, bongard string descriptions
int SStatArray::Add(const string& s)
{
for(uint i = 0 ; i < s.length() ; i++)
Add((double)s[i]);
return(1); // ?
}
//----------------------------------------------------------------------------
// adds each char up to first null
int SStatArray::Add(const char* s)
{
if(!s)
return(0);
for( ; *s ; s++)
Add((double)(*s));
return(1); // ?
}
//----------------------------------------------------------------------------
// unsure whether it's desirable to allow this
complex<double>& SStatArray::operator [](uint loc)
{
changed = true; // because you might change it
return(Dataset[loc]);
}
//----------------------------------------------------------------------------
void SStatArray::Flush()
{
changed = true;
Dataset.clear();
ResidualsY.clear();
}
//----------------------------------------------------------------------------
// end class SStatArray
//////////////////////////////////////////////////////////////////////////////
This version takes the name of the input file on the command line. Output to stdout can be redirected.
It has some VC++
specific code for handling the console attributes when running in a Command Prompt
box or from within the VC++ IDE.
/* DescStats.cpp 2-26-2009 Visual C++
Copyright (C)1997-99, 2009 Steven Whitney.
Published under GNU GPL (General Public License) Version 3, with ABSOLUTELY NO WARRANTY.
Utility program that uses a SStatArray to calculate statistics about data in a file.
*/
#include "stdafx.h"
#include "..\..\mylib\mylib.cpp"
#include "..\..\mylib\stataray.cpp"
using namespace std;
using namespace System;
int main(array<System::String ^> ^args)
{
// Set size of console window (optional)
//Console::SetWindowSize(160,60);
//Console::SetBufferSize(160,1000);
string filename("test.dat"); // this is the default input file during SStatArray development
if(args->Length < 1)
{
cout << endl;
cout << "Usage: DescStats FILENAME [> OUTFILENAME]" << endl << endl;
cout << "Calculates descriptive statistics about the set of numbers or ordered pairs in filename. ";
cout << "Use redirection to send the output to a file." << endl << endl;
//return(0);
}
else
{
SystemStringToBasicString(args[0],filename);
}
SStatArray s;
//s.SetConfidenceLevel(975);
s.ReadData(filename); // read the input data from the file
cout << endl;
s.WriteData(cout); // write the raw data back to console
cout << endl;
cout << s << endl; // write the stats to the console
cout << endl;
if(s.Count() > 1) // generate 24 predictions
{
cout << "Predictions: " << endl << endl;
double dx = s.RangeX() / (s.Count() - 1); // calculate average interval between x values
for(uint i = 0 ; i < 24 ; i++)
{
// when i==0, this is a prediction of the latest actual value of x
// beyond i==0, it is a prediction for subsequent values of x
double x = real(s.Last()) + (i * dx);
double nexty = s.PredictY(x);
cout << "#" << i << ": " << x << " -> " << nexty << endl;
}
}
cout << endl;
//Next two lines are required during testing from the IDE. Without it, the window closes before you can view the output.
Console::Write("Press <Enter> to close the window...");
Console::ReadLine();
return 0;
}
This version is more generic. It takes input from stdin, writes output to stdout.
In Linux, when entering input data to
console, type Ctrl+D to signal end of input data.
In Windows, you'd probably type Ctrl+Z.
/* DescStats.cpp 1-22-2010 GNU g++ or MSVC++
Copyright (C)1997-99, 2009, 2010 Steven Whitney.
Published under GNU GPL (General Public License) Version 3, with ABSOLUTELY NO WARRANTY.
Utility program that uses a SStatArray to calculate statistics about a dataset.
*/
#include "../include/mylib.cpp"
#include "../include/stataray.cpp"
using namespace std;
int main(int argc, char** argv)
{
cerr << "Calculates descriptive statistics about a set of numbers or ordered pairs." << endl;
cerr << "Example1: descstats [< infile] [> outfile]" << endl;
cerr << "Example2: sort -r < infile | descstats [> outfile]" << endl << endl;
SStatArray s;
//s.SetConfidenceLevel(975);
cin >> s;
cout << endl;
s.WriteData(cout); // echo the raw data
cout << endl;
cout << s << endl; // write the stats report
cout << endl;
if(s.Count() > 1) // generate 24 predictions
{
cout << "Predictions: " << endl << endl;
double dx = s.RangeX() / (s.Count() - 1); // calculate average interval between x values
for(uint i = 0 ; i < 24 ; i++)
{
// when i==0, this is a prediction of the latest actual value of x
// beyond i==0, it is a prediction for subsequent values of x
double x = real(s.Last()) + (i * dx);
double nexty = s.PredictY(x);
cout << "#" << i << ": " << x << " -> " << nexty << endl;
}
}
cout << endl;
return 0;
}
|
|
|