4
votes

SVD stands for Singular Value Decomposition and is said to be the popular technique to conduct feature reduction in text classification. I know the principle as this link.

I have been using C#, using Accord.Net library and had a jagged array double[][] from calculating TF-IDF already.

I already know that there are 4 topics in my documents. I want to test Kmean method with the number of clusters k = 4. Before using Kmean, I want to use SVD to conduct feature reduction. When the results show up, nearly 90% documents are group into 1 group, others are grouped into 3 other clusters. This is a very bad result. I have tried to rerun for a number of times but the results do not change much. If I use PCA instead of SDV, everything done well as expected.

So, where I am wrong. Anyone knowing this can guide me a sample code. Thank you a lot.


Note: my original TF-IDF has rows representing documents, column representing terms

Here is my code:

        //to matrix because the function SVD requiring input of matrix, not jagged array
        //transpose because the TF-IDF used for SVD has rows representing terms, columns representing documents; 
        var svd = new SingularValueDecomposition(tfidf.ToMatrix().Transpose());
        double[,] U = svd.LeftSingularVectors;
        double[,] S = svd.DiagonalMatrix;
        double[,] V = svd.RightSingularVectors;

        //find the optimal cutoff y so that we retain enough singular values to make up 90% of the energy in S
        //http://infolab.stanford.edu/~ullman/mmds/ch11.pdf, page 18-20
        double energy = 0;
        for (int i = 0; i < S.GetLength(0); i++)
        {
            energy += Math.Pow(S[i, i], 2);
        }

        double percent;
        int y = S.GetLength(0);
        do
        {
            y--;
            double test = 0;
            for (int i = 0; i < y; i++)
            {
                test += Math.Pow(S[i, i], 2);
            }

            percent = test / energy;
        } while (percent >= 0.9);
        y = y + 1;

        //Uk gets all rows, y first columns of U; Sk get y first rows, y first columns of S; Vk get y first rows, all columns of V
        double[,] Uk = U.Submatrix(0, U.GetLength(0) - 1, 0, y - 1);
        double[,] Sk = S.Submatrix(0, y - 1, 0, y - 1);
        double[,] Vk = V.Submatrix(0, y - 1, 0, V.GetLength(1) - 1);

        //reduce dimension according to http://stats.stackexchange.com/questions/107533/how-to-use-svd-for-dimensionality-reduction-to-reduce-the-number-of-columns-fea
        //we tranpose again to have the rows being document, columns being term as original TF-IDF
        //ToArray because the Kmean below acquiring input of jagged array
        tfidf = Uk.Multiply(Sk).Transpose().ToArray();
        // if tfidf = Uk.Multiply(Sk).Multiply(Vk).Transpose().ToArray()
        // result still bad

        // Create a K-Means algorithm using given k and a square Euclidean distance as distance metric.
        var kmeans = new KMeans(4, Distance.SquareEuclidean) { Tolerance = 0.05 };
        int[] labels = kmeans.Compute(tfidf);

After that, we do some steps to know which documents belongs to which groups according to labels.

1

1 Answers

5
votes

The PCA in Accord.NET is already computed using the SVD. For an example on how to perform SVD manually, without the help of the PCA class, you can always look at the PrincipalComponentAnalysis.cs source code.

The first step is to subtract the mean of your data (stored in variable x):

 int NumberOfInputs = x.Columns();
 this.Means = x.Mean(dimension: 0);
 z = x.Subtract(Means, dimension: 0);

Now, you can optionally divide your data by their standard deviations, effectivelly transforming your data to z-scores. This step is strictly optional, but might make sense when your data represents variables collected in units of widely varying orders of magnitude (i.e. one column represents heights in kilometers, the other in centimeters).

this.StandardDeviations = x.StandardDeviation(Means);
z = z.Divide(StandardDeviations, dimension: 0);

Now, the principal components of 'x' are the eigenvectors of Cov(x). Thus if we calculate the SVD of 'z' (which is x standardized), the columns of matrix V (right side of SVD) will be the principal components of x. With this, what we have to do now is to perform the Singular Value Decomposition (SVD) of the matrix z:

var svd = new JaggedSingularValueDecomposition(matrix,
    computeLeftSingularVectors: false,
    computeRightSingularVectors: true,
    autoTranspose: true);

var singularValues = svd.Diagonal;
var eigenvalues = SingularValues.Pow(2);
var eigenvalues.Divide(x.Rows() - 1);
var componentVectors = svd.RightSingularVectors.Transpose();

If you would like to perform whitening, you can also divide your vectors by the singular values:

componentVectors = componentVectors.Divide(singularValues, dimension: 1);

Now if you would like to project your data to up to 90% of the variance, compute the cumulative sum of the eigenValues similar to as you did:

// Calculate proportions
var componentProportions = eigenvalues.Abs().Divide(eigenValues.Abs().Sum());

// Calculate cumulative proportions
var componentCumulative = componentProportions.CumulativeSum();

Now, determine the number of dimensions that you need by looking where the cumulative proportions become larger than the proportion of variance that you want. After you know this number, select only those eigenvectors from the eigenvector matrix:

int numberOfVectors = // number of vectors that you need

for (int i = 0; i < rows; i++)
    for (int j = 0; j < numberOfVectors; j++)
        for (int k = 0; k < componentVectors[j].Length; k++)
            result[i][j] += z[i][k] * componentVectors[j][k];

In the example above I am transforming the matrix z, which was already mean centered and optionally standardized. Don't forget to apply the same transformations that you applied to the original matrix before you transform another set of data.

Finally, please keep in mind that doing all of the above manually is completely optional. You should really use the PrincipalComponentAnalysis class to do all this heavy lifting for you, for free!