DISTRIBUTION directive

Estimates the parameters of continuous and discrete distributions.


Options

PRINT = strings
Printed output required from each individual fit (parameters, samplestatistics, fittedvalues, proportions, monitoring); default para, samp, fitt

CBPRINT = strings
Printed output required from a fit combining all the input data (parameters, samplestatistics, fittedvalues, proportions, monitoring); default *

DISTRIBUTION = string
Distribution to be fitted (Poisson, geometric, logseries, negativebinomial, NeymanA, PolyaAeppli, PlogNormal, PPascal, Normal, dNvequal, dNvunequal, logNormal, exponential, gamma, Weibull, extremevalue, b1, b2, Pareto); default * i.e. fit nothing

CONSTANT = string
Whether to estimate a location parameter for the gamma, logNormal, Pareto or Weibull distributions (estimate, omit); default omit

LIMITS = variate
Variate to specify or save upper limits for classifying the data into groups; default *

NGROUPS = scalar
When LIMITS is not specified, this defines the number of groups (of approximately equal size) into which the data are to be classified; default is the integer value nearest to the square root of the number of data values

XDEVIATES = variate
Variate to specify points up to which the CUMPROPORTIONS are to be estimated

JOINT = string
Requests joint estimates from the combined fit to be used for a re-fit to the separate data sets (dispersion, variancemeanratio, Poissonindex); default *

PARAMETERS = variate
Estimated parameters from the combined fit

SE = variate
Standard errors for the estimated parameters of the combined fit

VCOVARIANCE = symmetric matrix
Variance-covariance matrix for the estimated parameters of the combined fit

CUMPROPORTIONS = variate
Estimated cumulative proportions of the combined distribution up to the values specified by the XDEVIATES option

MAXCYCLE = scalar
Maximum number of iterations; default 30

TOLERANCE = scalar
Convergence criterion; default 0.0001


Parameters

DATA = variates or tables
Data values either classified (table) or unclassified (variate)

NOBSERVATIONS = tables
One-way table to save the data classified into groups

RESIDUALS = tables
Residuals from each (individual) fit

FITTEDVALUES = tables
Fitted values from each fit

PARAMETERS = variates
Estimated parameters from each fit

SE = variates
Standard errors of the estimates

VCOVARIANCE = symmetric matrices
Variance-covariance matrix for each set of estimated parameters

CUMPROPORTIONS = variates
Estimated cumulative proportions of each distribution up to the values specified by the XDEVIATES option

CBRESIDUALS = tables
Residuals from the combined fit

CBFITTEDVALUES = tables
Fitted values from the combined fit

STEPLENGTH = variates
Initial step lengths for each fit

INITIAL = variates
Initial values for each set fit


Description

The DISTRIBUTION directive is used to fit an observed sample of data to a theoretical distribution function, in order to obtain maximum-likelihood estimates of the parameters of the distribution and test the goodness of fit. The data consists of observations xi of a random variable X, which has a distribution function F(x) defined by F(x)=Pr(Xx). A selection of both discrete and continuous distributions are available; full details are given below.

   For discrete distributions X may take non-negative integer values only, except for the log-series distribution where only positive integer values are allowed. For continuous distributions the random variable X may take any values, subject to constraints for certain distributions, for example, data values must be strictly positive in order to fit a log-Normal distribution. Constraints are detailed with the individual distributions described below.

   The data can be supplied to DISTRIBUTION as a variate or as a one-way table of counts. If the raw data are available, then these should be supplied (as a variate), since the raw data contains more information than grouped data.

   If raw data are not available, then a one-way table of counts, or frequencies, should be given. The factor classifying the table must have its levels vector declared explicitly, since the levels are used to indicate the boundary values of the raw data used to create the grouping. For example, if the discrete variable X takes the values 0...8, with numbers of observations 2,6,7,4,2,1,0,1,0 respectively, a table of counts can be declared by

FACTOR [LEVELS=!(0...8)] F

TABLE [CLASSIFICATION=F; VALUES=2,6,7,4,2,1,0,1,0] T

The factor levels do not have to specify single data values: often it will be desirable to group certain values together, and indeed for continuous data this is the only sensible way to proceed. In general, for a classifying factor with levels l1, l2, ... , lf, the count nk for the kth cell of the table will be the number of observations xi such that

    xil1,
k=1

    lk-1 < xilk,
2≤kf-1

    lf-1 < xi,
k=f

This means that for all except the last cell of the table, the factor level represents the upper limit on values in that cell. The final class of the table is termed the tail; it is formed by combining the frequencies for all values of X greater than lf-1, and the upper limit on values in the tail is infinity. For continuous distributions with no lower bound, the first class will be the lower tail. You will often want to form the tail(s) by amalgamating groups with low numbers of counts. In the example above, you might amalgamate the groups for values 6-8:

FACTOR [LEVELS=!(0...5,99)] F2

TABLE [CLASSIFICATION=F2; VALUES=2,6,7,4,2,1,1] T2

Note that the final factor level, for the tail, can be given a dummy value of 99 to indicate that it has no upper limit, since this value is never used in calculations.

   When data are supplied as a table instead of as a variate, the computed log-likelihood is only an approximation to the full log-likelihood and the solution obtained will depend to some extent on the choice of class limits. More reliable results will be achieved with a larger number of classes, since this gives more information on the data distribution, so only classes with very few observations should be amalgamated. In general, care should be taken to choose class limits that give a reasonable number of counts in each class, but with none of the individual classes holding a disproportionately large number of observations.

   The DISTRIBUTION option should be set to indicate which distribution is to be fitted to the data. The following distributions are available:

Discrete
Continuous

Binomial (as a special case
Normal

of the negative binomial)
Double Normal (equal variances)

Poisson
Double Normal (unequal variances)

Geometric
Log-Normal

Log-series
Exponential

Negative binomial
Gamma

Neyman type A
Weibull

Pólya-Aeppli
Extreme value

Poisson-log-Normal
Beta type I and type II

Poisson-Pascal
Pareto

   The first step of the fitting process is to compute and print various sample statistics. Examining these may help in the selection of appropriate distributions for fitting - properties of the various distributions are listed at the end of this section. The setting DISTRIBUTION=* can be used to produce this output without any model fitting. The following sample statistics are calculated:

Sample size

n

 

Sample mean

m = Σ xi/n

 

Sample variance

s2 = Σ xi2/n - m2

discrete distributions

 

s2 = Σ (xi-m)2 / (n-1)

continuous distributions

Sample skewness

g1 = Σ (xi-m)3 / (n-1)s3

    = m3/s3x

 

Sample kurtosis

g2 = Σ{(xi-m)4/(n-1)s4} - 3

continuous distributions only

Sample quartiles

xp: F(xp)=p

 

Poisson index

(s2-m)/m2

discrete distributions only

Negative binomial index

m(m3-3s2+2m)/(s2-m)2

discrete distributions only


If the original data are not available, the sample statistics are calculated by substituting class mid-points in place of the data. For the lower tail, the class "mid-point" is taken to be l1-½(l2-l1) and for the upper tail, lf-1+½(lf-1-lf-2). No corrections are made for groupings. When a distribution has been fitted to data, the relevant theoretical statistics of that distribution are printed for comparison with the sample statistics, as a check on the appropriateness of the model for the data.

   A summary is given of the fit: the parameter estimates are printed with their standard errors and correlations, including the working parameters, which are stable functions of the parameters defining the distribution and are used in the internal algorithm. The goodness of fit to the chosen distribution is indicated by the residual deviance which has an asymptotic chi-square distribution with the specified degrees of freedom. The deviance is also the preferred statistic for comparison of nested models, for example the double Normal distribution with equal and unequal variances. This is followed by a table of observed and fitted values (expected frequencies), together with weighted residuals. If raw data are supplied, by default this table is formed by dividing the data into √n groups of approximately equal observed frequency, which are therefore likely to be of unequal widths. The NGROUPS option may be used to set the number of groups for this table. If data are supplied as a table, the fitted values use the classification from that table. In either case the LIMITS option may be used to supply a different set of limits; with the constraint that if tabulated data are analysed these limits should be a subset of the original limits so that the new groups are formed by aggregation.

   The NOBSERVATIONS, RESIDUALS and FITTEDVALUES parameters can be used to save the number of observations in each cell, the fitted number, and the residual respectively (all in tables). The parameter estimates and their standard errors can be saved in variates specified by PARAMETERS and SE. The variance-covariance matrix for the estimated parameters can be saved as a symmetric matrix using the VCOVARIANCE parameter.

   Having fitted the required distribution, the estimated cumulative distribution function (CDF) can be evaluated at specified values of X. These are defined using the XDEVIATES option. The values of the CDF can be printed (by selecting PRINT=proportions) or saved in a variate by setting the CUMPROPORTION parameter.

   If you have several sets of data you may be interested in fitting the distribution individually to each set; this can be done by setting the DATA parameter to a list of identifiers. A separate analysis is then performed for each set of data, but of course any option settings are common to all the data sets. The data sets should all be specified in the same way, either as raw data or as tabulated counts. For tabulated counts, the same categories must be used for defining every table. You can also carry out one final fit to the combined data set, in order to investigate whether the data can be adequately modelled as coming from a single population. This combined fit is produced if any of the options relating to the combined fit have been set (that is, options CBPRINT, PARAMETERS, SE, VCOVARIANCE or CUMPROPORTION which print or save information from the combined analysis). For each individual data set you can also save fitted values and residuals based on the parameters estimated from the combined data set, using the CBRESIDUALS and CBFITTEDVALUES parameters. The JOINT option can be used to specify that certain parameters should be held constant at their estimated values from the combined analysis during refits to the individual data sets. For continuous distributions only, a common dispersion parameter can be requested; for discrete distributions a common value can be requested for either the Poisson index or the ratio of variance to mean. An analysis of deviance is printed to compare the nested models.

   If the original data are available, the full log-likelihood is used in the optimization algorithm. Otherwise, an approximate log-likelihood is optimized, using representative values for each class. For some distributions, it is necessary to use stable working parameters in the optimization algorithm (Ross 1990), and the defining parameters for the distribution are then evaluated by a simple transformation.

   The deviance and corresponding degrees of freedom that are printed as part of the model summary are based on the table of fitted values, and thus may be affected by the choice of limits. The residuals computed are deviance residuals (McCullagh & Nelder 1989), and the deviance is therefore the sum of squared residuals. The degrees of freedom are n-p-1, where n is the number of cells in the table of fitted values and p is the number of parameters estimated in the model. The default limits for grouping the raw data are designed to avoid small expected frequencies (for example in the tail cells) which can have an inflationary affect on the deviance; however, if the tails are important, because of the origin of the data, it may be important to specify the limits explicitly.

   An iterative Gauss-Newton optimization method is used to estimate the parameters of the distribution. The parameterization is chosen for each model so that the optimization is stable, but if there are any problems with particular data sets it may be necessary to control this process. The MAXCYCLE and TOLERANCE options allow you to increase the number of iterations and alter the convergence criterion for data sets that fail to converge. You can also specify initial values and step lengths for the parameters for each set of data using the STEPLENGTH and INITIAL parameters. These parameters should be set to variates of length appropriate for the distribution being fitted; for example, if DISTRIBUTION=Poisson they should have just one value. Another use of INITIAL and STEPLENGTH is to constrain a parameter to a particular value; for example when fitting a double Normal the proportion parameter p could be fixed at 0.5 by setting the initial value to 0.5 and the steplength to 0, thus fitting a double Normal in equal proportions. Note that the degrees of freedom are not adjusted to take account of this.

 

Options: PRINT, CBPRINT, DISTRIBUTION, CONSTANT, LIMITS, NGROUPS, XDEVIATES, JOINT, PARAMETERS , SE, VCOVARIANCE, CUMPROPORTIONS, MAXCYCLE, TOLERANCE.

Parameters: DATA, NOBSERVATIONS, RESIDUALS, FITTEDVALUES, PARAMETERS, SE, VCOVARIANCE, CUMPROPORTIONS, CBRESIDUALS, CBFITTEDVALUES, STEPLENGTH, INITIAL.


Action with RESTRICT

You can restrict the units of a DATA variate to fit a distribution to a subset of its values.


References

McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London.

Ross, G.J.S. (1990). Nonlinear Estimation. Springer-Verlag, New York.