Any empirical data can be approximated to one of Pearson distributions using the first four moments of the data (Elderton WP, Johnson NL. Systems of Frequency Curves. 1969; Pearson K. Philos Trans R Soc Lond Ser A. 186:343–414 1895; Solomon H, Stephens MA. J Am Stat Assoc. 73(361):153–60 1978). Thus, Pearson distributions made statistical analysis possible for data with unknown distributions. There are both extant, old-fashioned in-print tables (Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. 1972) and contemporary computer programs (Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions. 1971; Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β_{1} and β_{2}. 1974; Bowman KO, Shenton LR. Biometrika. 66(1):147–51 1979; Davis CS, Stephens MA. Appl Stat. 32(3):322–7 1983; Pan W. J Stat Softw. 31(Code Snippet 2):1–6 2009) available for obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.), but they are little useful in statistical analysis because we have to rely on unwieldy second difference interpolation to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing.

Results

The present study develops a SAS/IML macro program to identify the appropriate type of Pearson distribution based on either input of dataset or the values of four moments and then compute and graph probability values of Pearson distributions for any given percentage points.

Conclusions

The SAS macro program returns accurate approximations to Pearson distributions and can efficiently facilitate researchers to conduct statistical analysis on data with unknown distributions.

Background

Most of statistical analysis relies on normal distributions, but this assumption is often difficult to meet in reality. Pearson distributions can be approximated for any data using the first four moments of the data [1–3]. Thus, Pearson distributions made statistical analysis possible for any data with unknown distributions. For instance, in hypothesis testing, a sampling distribution of an observed test statistic is usually unknown but the sampling distribution can be fitted into one of Pearson distributions. Then, we can compute and use a p-value (or probability value) of the approximated Pearson distribution to make a statistical decision for such distribution-free hypothesis testing.

There are both extant, old-fashioned in-print tables [4] and contemporary computer programs [5–9] that provided a means of obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.). Unfortunately, they are little useful in statistical analysis because we have to employ unwieldy second difference interpolation for both skewness √β_{1} and kurtosis β_{2} to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing. Thus, a new program is needed for efficiently computing probability values of Pearson distributions for any given data point; and therefore, researchers can utilize the program to conduct more applicable statistical analysis, such as distribution-free hypothesis testing, on data with unknown distributions.

Pearson distributions are a family of distributions which consist of seven different types of distributions plus normal distribution (Table 1). To determine the type of the Pearson distribution and the required parameters of the density function for the chosen type, the only thing we need to know is the first four moments of the data. Let X represent given data, and its first four central moments can be calculated by

The four central moments can also be uniquely determined by mean, variance, skewness, and kurtosis, which are more commonly used parameters for a distribution and easily obtained from statistical software. The relationships between skewness √β_{1} and the third central moment, and between kurtosis β_{2} and the fourth central moment are illustrated as follows:

Once the four central moments or the mean, variance, skewness, and kurtosis are calculated, the types of Pearson distributions to which X will be approximated can be determined by a κ-criterion that is defined as follows [1]:

The determination of types of Pearson distributions by the κ-criterion (Eq. 3) is illustrated in Table 1. From Table 1, we can also see that for each type of Pearson distributions, its density function has a closed form with a clearly defined domain of X. The closed form of density functions made numerical integration possible for obtaining probability values of approximated Pearson distributions. For each type of Pearson distributions, the required parameters of the density function are calculated by using different formulas. Without loss of generality, we illustrate the type IV formula below. The formula for the rest of the types can be retrieved from [1].

The density function for type IV Pearson distribution is

$$ y = y_{0}\left(1+\frac{(x-\lambda)^{2}}{a^{2}}\right)^{-m}e^{-\nu\tan^{-1}(x-\lambda)/a}, $$

(4)

where \(m=\frac {1}{2}(r+2)\), \(\nu =\frac {-r(r-2)\sqrt \beta _{1}}{\sqrt {16(r-1)-\beta _{1}(r-2)^{2}}}\), \(r=\frac {6(\beta _{2}-\beta _{1}-1)}{2\beta _{2}-3\beta _{1}-6}\), the scale parameter \(a=\sqrt {(\mu _{2}/16)}\sqrt {(16(r-1)-\beta _{1}(r-2)^{2})}\), the location parameter λ=μ_{1}+νa/r, and normalization coefficient \(y_{0}=\frac {N}{aF(r,\nu)}\).

The required parameters for each type of Pearson distribution density functions will be automatically computed in a SAS/IML [10] macro program described in the next section. Then, probability values of Pearson distributions can be obtained through numerical integration with the SAS subroutine QUAD.

Implementation

To add the flexibility to the macro, we allow two different ways to input required information. The first one is to input the dataset and variable. The macro will automatically calculate the mean, variance, skewness, and kurtosis of the input variable. The second one is to input the mean, variance, skewness, and kurtosis of the variable directly. The main SAS/IML macro program (see Additional file 1) to compute and graph probability values of Pearson distributions is as follows: %PearsonProb(data=, var=, mean=, variance=, skew=, kurt=, x0=, plot=)

wheredata = the name of the dataset to calculate four moments (this input can be omitted if mean, variance, skewness, and kurtosis input used); var = the name of variable in the dataset to calculate moments (this input can be omitted if mean, variance, skewness, and kurtosis input used); mean = the mean of the variable (this input can be omitted if data and var input used); variance = the variance of the variable (this input can be omitted if data and var input used); skew = the skewness of the variable (this input can be omitted if data and var input used); kurt = the kurtosis of the variable (this input can be omitted if data and var input used); x0 = the percentage point x_{0}; plot = 1 for graph, 0 for no graph.

This SAS/IML macro program has four steps. The first step is to either calculate mean, variance, skewness, and kurtosis based on the input dataset or take the four values directly from inputted parameters. The second step is to calculate κ by using Eq. (3) and identify a specific type of Pearson distribution based on the κ-criterion displayed in Table 1. Once the type of Pearson distribution is determined, in the third step, the macro will calculate the parameters of density function for the specific type of Pearson distribution. For example, for type IV Pearson distribution, y_{0}, m, ν, a, and λ will be calculated according to the specifications underneath Eq. (4). In the fourth and last step, the probability value of the specific type of Pearson distribution corresponding to the inputted percentage point x_{0} will be calculated by the SAS subroutine QUAD for numerical integration. If the inputted x_{0} is beyond the defined domain, a warning message will be printed as “WARNING: x0 is out of the domain of type VI Pearson distribution,” for example. If successful, the computed probability value along with the parameters are printed (see Fig. 1).

To graph the probability value on the approximated density function of the Pearson distribution, a small SAS/IML macro %plotprob was written for use within the main SAS/IML macro %PearsonProb(data=, var=, mean=, variance=, skew=, kurt=, x0=, plot=). If 1 is inputted for plot, the SAS subroutines GDRAW, GPLOY, etc. are called in the small graphing macro for plotting the density function and indicating probability value. Otherwise (i.e., plot = 0), no graph is produced.

To illustrate the process, we provide an example of input and output below (two example datasets are available online: Additional files 2 & 3). One could either input a dataset and variable name (Item 1) or input the values of “mean”, “variance”, “skewness”, and “kurtosis” (Item 2) to the %PearsonProb macro. Both the dataset “dataIV” and the values of the four moments for this example are taken from [1].

The outputs from both the statements are the same. The standard output (see Fig. 1) includes the values of mean, variance, skewness, and kurtosis; and indicates the type of the Pearson distribution identified. It also outputs the formula for the density function and the values of the parameters of the density function. Lastly, it prints the calculated probability. Since we used the plot = 1 option, a figure to illustrate the distribution and probability is also produced (see Fig. 2).

Results

To evaluate the accuracy of the SAS/IML macro program for computing and graphing probability values of Pearson distributions, the calculated parameters of the approximated Pearson distributions from this SAS/IML macro were first compared with the corresponding ones in [1]. As can be seen in Table 2, the absolute differences between the calculated parameters from the SAS/IML macro and those from [1]’s tables are all very small with almost all of them less than.001 and a few less than.019. The same story applies to the relative differences with an unsurprising exception (4.46%) of κ for type IV whose original magnitude is very small.

Then, the computed probability values from the SAS/IML macro were evaluated using the percentage points in [4]’s Table 32 (p. 276) corresponding to probability values of 2.5% and 97.5% for illustration purposes only. From Table 3, we can see that the probability values computed from the SAS/IML macro are very close to.025 (or 2.5%) and.975 (or 97.5%), respectively, with a high degree of precision (less than.0001).

Discussion

Pearson distributions are a family of non-parametric distributions. It is often used when the normal distribution assumption is not applicable to the data. In this paper, the first approach of inputting dataset as parameters for the macro is more often used. The second approach of entering first four moments as parameters are more helpful when the researcher already performed some descriptive statistics based on the data in the first approach.

Conclusions

The new SAS/IML macro program provides an efficient and accurate means to determine the type of Pearson distribution based on either a dataset or values of the first four moments and then compute probability values of the specific Pearson distributions. Thus, researchers can utilize this SAS/IML macro program in conducting distribution-free statistical analysis for any data with unknown distributions. The SAS/IML macro program also provides a nice feature of graphing the probability values of Pearson distributions to visualize the probability values on the Pearson distribution curves.

Availability and requirements

Project name: PearsonProb

Project home page: To be available

Operating system(s): Platform independent

Programming language: SAS/IML

Other requirements: SAS 9.4 or higher

License: Not applicable

Any restrictions to use by non-academics: None

Availability of data and materials

Not applicable.

References

Elderton WP, Johnson NL. Systems of Frequency Curves. London: Cambridge University Press; 1969.

Pearson K. Contributions to the mathematical theory of evolution. ii. skew variations in homogeneous material. Philos Trans R Soc Lond Ser A. 1895; 186:343–414.

Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions, Research Report SC-RR-71 0348. Albuquerque: Sanida Laboratories; 1971.

Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β_{1} and β_{2}, Technical Report No. 107. Georgia: Department of Statistics and Computer Science, University of Georgia; 1974.

QY extensively revised manuscript and the SAS program. XA revised the manuscript. WP initially wrote the manuscript and the SAS program. All authors read and approved the final manuscript.

SAS/IML macro program. The SAS/IML macro program for computing and graphing probability values of Pearson distributions is available as an additional file, PearsonDistributionProbfinal.sas

Sample dataset 2. The dataset dataIV.sas7bdat was taken from [1].

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Yang, Q., An, X. & Pan, W. Computing and graphing probability values of pearson distributions: a SAS/IML macro.
Source Code Biol Med14, 6 (2019). https://doi.org/10.1186/s13029-019-0076-2