0
votes

I'm trying to do feature selection using PCA in Java without a ML framework, just using the Apache math matrix library.

The input test data is a 2D array, 4 feature cols x 100 row instances. Roughly i've gone through the following steps:

  1. Load the data, normalise, store in the RealMatrix class and calculate the covariance matrix:

    PCAResultSet pcaResultSet = new PCAResultSet();
    double[][] data = dataToDoubleArray();
    
    if (this.normalize == true)
        data = StatMath.normalize(data);
    
    RealMatrix origData = MatrixUtils.createRealMatrix(data);
    Covariance covariance = new Covariance(origData);
    
    /* The eigenvectors of the covariance matrix represent the 
     * principal components (the directions of maximum variance) 
     */
    RealMatrix covarianceMatrix = covariance.getCovarianceMatrix();
    
  2. Perform an eigendecomposition and get the eigenvectors & eigenvalues

    /* Each of those eigenvectors is associated with an eigenvalue which can be 
     * interpreted as the “length” or “magnitude” of the corresponding eigenvector. 
     * If some eigenvalues have a significantly larger magnitude than others, 
     * then the reduction of the dataset via PCA onto a smaller dimensional subspace 
     * by dropping the “less informative” eigenpairs is reasonable.
     * 
     *  Eigenvectors represent the relative basis (axis) for the data
     *  
     *  Computes new variables from the PCA analysis
     */
    EigenDecomposition decomp = new EigenDecomposition(covarianceMatrix);
    
    /* The numbers on the diagonal of the diagonalized covariance matrix 
     * are called eigenvalues of the covariance matrix. Large eigenvalues 
     * correspond to large variances. 
     */
    double[] eigenvalues = decomp.getRealEigenvalues();
    
    /* The directions of the new rotated axes are called the 
     * eigenvectors of the covariance matrix.
     * 
     * Rows are eigenvectors 
     */
    RealMatrix eigenvectors = decomp.getV(); 
    
    pcaResultSet.setEigenvectors(eigenvectors);
    pcaResultSet.setEigenvalues(eigenvalues);
    
  3. Select the first n eigenvectors (they're ordered by default) and this project the data by multiplying the transposed n x m eigenvectors with the transposed original data

    /* Keep the first n-cols corresponding to the
     * highest PCAs
     */
    int rows = eigenvectors.getRowDimension();
    int cols = 1;
    
    RealMatrix evecTran = eigenvectors.getSubMatrix(0, rows - 1, 0, cols - 1).transpose();
    RealMatrix origTran = origData.transpose();
    
    /* The projected data onto the lower-dimension hyperplane */
    RealMatrix dataProj = evecTran.multiply(origTran).transpose();
    
  4. Finally calculating the explained variance of each principcal component

    /* The variance explained ratio of an eigenvalue λ_j is 
     * simply the fraction of an eigenvalue λ_j and the total 
     * sum of the eigenvalues 
     */
    double[] explainedVariance = new double[eigenvalues.length];
    double sum = StatMath.sum(eigenvalues);
    
    for (int i = 0; i < eigenvalues.length; i++)
        explainedVariance[i] = ((eigenvalues[i] / sum) * 100);
    
    pcaResultSet.setExplainedVariance(explainedVariance);
    pcaResultSet.print();
    
    Utils.print("PCA", "Projected Data:", 0, true);
    printMatrix(dataProj);
    
    return pcaResultSet;
    

Using this code, PC1 explains roughly 90% of the variance but how do I take this result to perform feature selection to figure out which features to drop from the original data?

A framework like Weka will rank the features to show which combination from the original set produce the highest result, I'm trying to do the same but am unsure how the eigenvectors/decop scores map back to the original features

1

1 Answers

0
votes

From your question what I understood is that you want to do feature selection or elimination using PCA.

One of the ways of doing it is by taking the reconstruction error.

To calculate reconstruction error, you need to do an inverse PCA, so from the principal components you will get the original values of the 2D array. Let's call this reconstrData and your original array is originalData.

Now find the error matrix(let's call it errorMat) which is nothing but reconstrData - originalData

Now in this errorMat find the MAE across each of the columns. Now the top n columns with lowest MAE can be selected or the top m columns with highest MAE can be rejected.

Sorry, I don't know JAVA so couldn't post the code. But I can help you conceptually, so let me know if you face any difficulty in implementing the above logic.