Computing and graphing probability values of pearson distributions: a SAS/IML macro

Background Any empirical data can be approximated to one of Pearson distributions using the first four moments of the data (Elderton WP, Johnson NL. Systems of Frequency Curves. 1969; Pearson K. Philos Trans R Soc Lond Ser A. 186:343–414 1895; Solomon H, Stephens MA. J Am Stat Assoc. 73(361):153–60 1978). Thus, Pearson distributions made statistical analysis possible for data with unknown distributions. There are both extant, old-fashioned in-print tables (Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. 1972) and contemporary computer programs (Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions. 1971; Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β1 and β2. 1974; Bowman KO, Shenton LR. Biometrika. 66(1):147–51 1979; Davis CS, Stephens MA. Appl Stat. 32(3):322–7 1983; Pan W. J Stat Softw. 31(Code Snippet 2):1–6 2009) available for obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.), but they are little useful in statistical analysis because we have to rely on unwieldy second difference interpolation to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing. Results The present study develops a SAS/IML macro program to identify the appropriate type of Pearson distribution based on either input of dataset or the values of four moments and then compute and graph probability values of Pearson distributions for any given percentage points. Conclusions The SAS macro program returns accurate approximations to Pearson distributions and can efficiently facilitate researchers to conduct statistical analysis on data with unknown distributions.


Background
Most of statistical analysis relies on normal distributions, but this assumption is often difficult to meet in reality. Pearson distributions can be approximated for any data using the first four moments of the data [1][2][3]. Thus, Pearson distributions made statistical analysis possible for any data with unknown distributions. For instance, in hypothesis testing, a sampling distribution of an observed test statistic is usually unknown but the sampling distribution can be fitted into one of Pearson distributions. Then, we can compute and use a p-value (or probability value) of the *Correspondence: wei.pan@duke.edu 1 Duke University, 27710, Durham, USA Full list of author information is available at the end of the article approximated Pearson distribution to make a statistical decision for such distribution-free hypothesis testing.
There are both extant, old-fashioned in-print tables [4] and contemporary computer programs [5][6][7][8][9] that provided a means of obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.). Unfortunately, they are little useful in statistical analysis because we have to employ unwieldy second difference interpolation for both skewness √ β 1 and kurtosis β 2 to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing.
Thus, a new program is needed for efficiently computing probability values of Pearson distributions for any given data point; and therefore, researchers can utilize the program to conduct more applicable statistical analysis, such as distribution-free hypothesis testing, on data with unknown distributions. Pearson distributions are a family of distributions which consist of seven different types of distributions plus normal distribution (Table 1). To determine the type of the Pearson distribution and the required parameters of the density function for the chosen type, the only thing we need to know is the first four moments of the data. Let X represent given data, and its first four central moments can be calculated by The four central moments can also be uniquely determined by mean, variance, skewness, and kurtosis, which are more commonly used parameters for a distribution and easily obtained from statistical software. The relationships between skewness √ β 1 and the third central moment, and between kurtosis β 2 and the fourth central moment are illustrated as follows: Once the four central moments or the mean, variance, skewness, and kurtosis are calculated, the types of Pearson distributions to which X will be approximated can be determined by a κ-criterion that is defined as follows [1]: The determination of types of Pearson distributions by the κ-criterion (Eq. 3) is illustrated in Table 1. From Table 1, we can also see that for each type of Pearson distributions, its density function has a closed form with a clearly defined domain of X. The closed form of density functions made numerical integration possible for obtaining probability values of approximated Pearson distributions. For each type of Pearson distributions, the required parameters of the density function are calculated by using different formulas. Without loss of generality, we illustrate the type IV formula below. The formula for the rest of the types can be retrieved from [1].
The density function for type IV Pearson distribution is . The required parameters for each type of Pearson distribution density functions will be automatically computed in a SAS/IML [10] macro program described in the next section. Then, probability values of Pearson distributions can be obtained through numerical integration with the SAS subroutine QUAD.

Implementation
To add the flexibility to the macro, we allow two different ways to input required information. The first one is to input the dataset and variable. The macro will automatically calculate the mean, variance, skewness, and kurtosis of the input variable. The second one is to input the mean, variance, skewness, and kurtosis of the variable

Transition Type
Normal ; x0 = the percentage point x 0 ; plot = 1 for graph, 0 for no graph. This SAS/IML macro program has four steps. The first step is to either calculate mean, variance, skewness, and kurtosis based on the input dataset or take the four values directly from inputted parameters. The second step is to calculate κ by using Eq. (3) and identify a specific type of Pearson distribution based on the κ-criterion displayed in Table 1. Once the type of Pearson distribution is determined, in the third step, the macro will calculate the parameters of density function for the specific type of Pearson distribution. For example, for type IV Pearson distribution, y 0 , m, ν, a, and λ will be calculated according to the specifications underneath Eq. (4). In the fourth and last step, the probability value of the specific type of Pearson distribution corresponding to the inputted percentage point x 0 will be calculated by the SAS subroutine QUAD for numerical integration. If the inputted x 0 is beyond the defined domain, a warning message will be printed as "WARNING: x0 is out of the domain of type VI Pearson distribution, " for example. If successful, the computed probability value along with the parameters are printed (see Fig. 1).
To graph the probability value on the approximated density function of the Pearson distribution, a small SAS/IML macro %plotprob was written for use within the main SAS/IML macro %PearsonProb(data=, var=, mean=, variance=, skew=, kurt=, x0=, plot=). If 1 is inputted for plot, the SAS subroutines GDRAW, GPLOY, etc. are called in the small graphing macro for plotting the density function and indicating probability value. Otherwise (i.e., plot = 0), no graph is produced.
To illustrate the process, we provide an example of input and output below (two example datasets are available online: Additional files 2 & 3). One could either input a dataset and variable name (Item 1) or input the values of "mean", "variance", "skewness", and "kurtosis" (Item 2) to the %PearsonProb macro. Both the dataset "dataIV" and the values of the four moments for this example are taken from [1].  The outputs from both the statements are the same. The standard output (see Fig. 1) includes the values of mean, variance, skewness, and kurtosis; and indicates the type of the Pearson distribution identified. It also outputs the formula for the density function and the values of the  parameters of the density function. Lastly, it prints the calculated probability. Since we used the plot = 1 option, a figure to illustrate the distribution and probability is also produced (see Fig. 2).

Results
To evaluate the accuracy of the SAS/IML macro program for computing and graphing probability values of Pearson distributions, the calculated parameters of the approximated Pearson distributions from this SAS/IML macro were first compared with the corresponding ones in [1]. As can be seen in Table 2, the absolute differences between the calculated parameters from the SAS/IML macro and those from [1]'s tables are all very small with almost all of them less than .001 and a few less than .019. The same story applies to the relative differences with an unsurprising exception (4.46%) of κ for type IV whose original magnitude is very small. Then, the computed probability values from the SAS/IML macro were evaluated using the percentage points in [4]'s Table 32 (p. 276) corresponding to probability values of 2.5% and 97.5% for illustration purposes only. From Table 3, we can see that the probability values computed from the SAS/IML macro are very close to .025 (or 2.5%) and .975 (or 97.5%), respectively, with a high degree of precision (less than .0001).

Discussion
Pearson distributions are a family of non-parametric distributions. It is often used when the normal distribution assumption is not applicable to the data. In this paper, the first approach of inputting dataset as parameters for the macro is more often used. The second approach of entering first four moments as parameters are more helpful when the researcher already performed some descriptive statistics based on the data in the first approach.

Conclusions
The new SAS/IML macro program provides an efficient and accurate means to determine the type of Pearson distribution based on either a dataset or values of the first four moments and then compute probability values of the specific Pearson distributions. Thus, researchers can utilize this SAS/IML macro program in conducting distribution-free statistical analysis for any data with unknown distributions. The SAS/IML macro program also provides a nice feature of graphing the probability values of Pearson distributions to visualize the probability values on the Pearson distribution curves.