Construction of Composite Indices: Alternatives to the Principal Components Analysis

Given X(n,m), data on n cases (observations) and m variables, we often want to obtain Index(i) = w(1) x(i,1) + w(2) x(i,2) +, ...,+ w(m) x(i,m); i=1,2,...,n. Sometimes, w(j), j=1,m are obtained extraneously (by expert opinion or based on some other data say, Y). However, at other times we want to derive weights mathematically from X(n,m) itself. The principal components analysis is an often-used method to derive such mathematical weights from X(n,m). It maximizes the sum of squared correlation coefficients of the Index with x(1), x(2),..., x(m).

It has been a general experience that the principal components analysis ignores (or undermines) the variables that are not highly correlated with other fellow variables. It has a clear preference for highly correlated subset of variables. Thus the index obtained by the principal components analysis is elitist.

It is possible to obtain weights from X(n,m) such that (i) the sum of absolute coefficients of correlation between the Index and the variables (x(1), x(2), etc) is maximized or (ii) the entropy-like function of correlation coefficients between the Index and the variables (x(1), x(2), etc) is maximized (iii) or the minimal correlation coefficient between the Index and the variables (x(1), x(2), etc) is maximized. We may use the product moment correlation coefficient or absolute correlation coefficient of Bradley for making such indices. For details see the paper A Comparative Study of Various Inclusive Indices and the Index Constructed By the Principal Components Analysis.

We obtain such indices by maximizing the said quantities directly by the Differential Evolution method of global optimization. The program (FORTRAN codes) that we use may be downloaded from here.

To use the program one has to use a suitable FORTRAN compiler. A Fortran compiler may be obtained from here or alternatively from here freely. The program may be directly copied and pasted in the editor of the compiler.

It is required that the user sets the parameters NOB and MVAR according to his problem. These parameters are to be changed in two statements : (1) in the main program at line 15 and (2) in the subroutine CORD at line 474.

Before running the program one requires to store X(n,m) into a text (filename.txt) file. The program reads data from this file. When the program runs, it asks for four inputs first. Those are: (i) a four-digit (non-zero) integer number or seed to generate random numbers, (ii) norm - 1 for absolute, 2 for Euclidean and 3 for maximin index, (iii) entropy - 0 for norm maximization and 1 for entropy maximization, (iv) ncor - 0 for product moment correlation and 1 for absolute correlation (Bradley). By the way, norm=2 (with entropy=0 and ncor=0) is the same as the principal components index.

Next, the program wants the input file name. It is the file where X(n,m) has been stored. NOB is the same as n and MVAR is the same as m. Then it asks for the output file wherein the results will be stored.

Sometimes researchers unitize the variables. Unitization may be accomplished in many ways, but one of the procedures is to subtract the minimal value of a variable (x) from values of different cases (x(i)) and then divide them by max(x)-min(x) of the variable or [x(i,j)-min(x(.,j))]/[max(x(.,j))-min(x(.,j))]. Lest min(x) and max(x) are free from error, this practice may be discouraged. Change of scale in this manner has no effect on correlation coefficients, but for no benefit it runs the risk of being affected by error in min(x) or max(x). The program asks for the option if unitization is to be done. Feed 1 for 'yes'; any other number for 'no'. Suggested option is 'no', that is feed any number other than 1.

Further the program asks for another seed to generate the random number. Any 4-digit (non-zero) integer number would do. Next, one has to input m (mvar) or the number of variables in the data. Then, population size and the number of iterations to be performed. The population size should not be less than ten times the number of variables in the problem or population size equal to or greater than [=>] 10 x mvar. Iterations could be 10000 or more. Next, two more inputs are required - the PCROSS or the probability of crossover which should be about 0.9 and FACT which may be around 0.5. Finally, the program requires another seed for random number generation - a 4-digit (non-zero) integer number. The program proceeds to obtain indices. It stores the results in the output file specified by the user. The output file may be opened by any text editor or MS WORD.

The vector of index and the vector of correlation coefficients of the index with the variables are indeterminate as to their sign such that if one multiplied the entire index by -1, one has to multiply the entire vector of correlation too by -1. This is because the method uses absolute or squared correlation that disregards sign. Weights assigned to the variables also are to be multiplied by -1 for consistency. This indeterminacy is not specific to the method used here. This is true also when the traditional principal components analysis is used.