如何在 Accord.net 中正确使用 SVD答案

【问题标题】：How to use SVD correctly in Accord.net如何在 Accord.net 中正确使用 SVD
【发布时间】：2016-09-14 20:42:35
【问题描述】：

SVD 代表奇异值分解，据说是在文本分类中进行特征缩减的流行技术。我知道原理是this link。

我一直在使用 C#，使用 Accord.Net 库，并且已经在计算 TF-IDF 时得到了一个锯齿状数组 double[][]。

我已经知道我的文档中有 4 个主题。我想用聚类数 k = 4 来测试 Kmean 方法。在使用 Kmean 之前，我想使用 SVD 进行特征缩减。当结果显示时，将近 90% 的文档被分为 1 个组，其他文档被分为 3 个其他集群。这是一个非常糟糕的结果。我已尝试重新运行多次，但结果并没有太大变化。如果我使用 PCA 而不是 SDV，一切都会按预期进行。

所以，我错了。任何知道这一点的人都可以指导我一个示例代码。非常感谢。

注意：我原来的 TF-IDF 行代表文档，列代表术语

这是我的代码：

        //to matrix because the function SVD requiring input of matrix, not jagged array
        //transpose because the TF-IDF used for SVD has rows representing terms, columns representing documents; 
        var svd = new SingularValueDecomposition(tfidf.ToMatrix().Transpose());
        double[,] U = svd.LeftSingularVectors;
        double[,] S = svd.DiagonalMatrix;
        double[,] V = svd.RightSingularVectors;

        //find the optimal cutoff y so that we retain enough singular values to make up 90% of the energy in S
        //http://infolab.stanford.edu/~ullman/mmds/ch11.pdf, page 18-20
        double energy = 0;
        for (int i = 0; i < S.GetLength(0); i++)
        {
            energy += Math.Pow(S[i, i], 2);
        }

        double percent;
        int y = S.GetLength(0);
        do
        {
            y--;
            double test = 0;
            for (int i = 0; i < y; i++)
            {
                test += Math.Pow(S[i, i], 2);
            }

            percent = test / energy;
        } while (percent >= 0.9);
        y = y + 1;

        //Uk gets all rows, y first columns of U; Sk get y first rows, y first columns of S; Vk get y first rows, all columns of V
        double[,] Uk = U.Submatrix(0, U.GetLength(0) - 1, 0, y - 1);
        double[,] Sk = S.Submatrix(0, y - 1, 0, y - 1);
        double[,] Vk = V.Submatrix(0, y - 1, 0, V.GetLength(1) - 1);

        //reduce dimension according to http://stats.stackexchange.com/questions/107533/how-to-use-svd-for-dimensionality-reduction-to-reduce-the-number-of-columns-fea
        //we tranpose again to have the rows being document, columns being term as original TF-IDF
        //ToArray because the Kmean below acquiring input of jagged array
        tfidf = Uk.Multiply(Sk).Transpose().ToArray();
        // if tfidf = Uk.Multiply(Sk).Multiply(Vk).Transpose().ToArray()
        // result still bad

        // Create a K-Means algorithm using given k and a square Euclidean distance as distance metric.
        var kmeans = new KMeans(4, Distance.SquareEuclidean) { Tolerance = 0.05 };
        int[] labels = kmeans.Compute(tfidf);

之后，我们通过一些步骤来根据标签知道哪些文档属于哪些组。

【问题讨论】：

标签： c# svd accord.net

【解决方案1】：

Accord.NET 中的 PCA 已使用 SVD 计算。有关如何手动执行 SVD 的示例，无需 PCA 类的帮助，您可以随时查看PrincipalComponentAnalysis.cs source code。

第一步是减去数据的平均值（存储在变量x中）：

 int NumberOfInputs = x.Columns();
 this.Means = x.Mean(dimension: 0);
 z = x.Subtract(Means, dimension: 0);

现在，您可以选择将数据除以标准差，从而有效地将数据转换为 z 分数。此步骤严格来说是可选的，但当您的数据表示以数量级差异很大的单位收集的变量时（即，一列表示以千米为单位的高度，另一列以厘米为单位）时，这可能是有意义的。

this.StandardDeviations = x.StandardDeviation(Means);
z = z.Divide(StandardDeviations, dimension: 0);

现在，“x”的主成分是 Cov(x) 的特征向量。因此，如果我们计算 'z' 的 SVD（x 标准化），矩阵 V 的列（SVD 右侧）将是x 的主要成分。有了这个，我们现在要做的就是执行矩阵 z 的奇异值分解（SVD）：

var svd = new JaggedSingularValueDecomposition(matrix,
    computeLeftSingularVectors: false,
    computeRightSingularVectors: true,
    autoTranspose: true);

var singularValues = svd.Diagonal;
var eigenvalues = SingularValues.Pow(2);
var eigenvalues.Divide(x.Rows() - 1);
var componentVectors = svd.RightSingularVectors.Transpose();

如果你想进行白化，你也可以将你的向量除以奇异值：

componentVectors = componentVectors.Divide(singularValues, dimension: 1);

现在，如果您想将数据投影到最多 90% 的方差，计算特征值的累积和，类似于您所做的：

// Calculate proportions
var componentProportions = eigenvalues.Abs().Divide(eigenValues.Abs().Sum());

// Calculate cumulative proportions
var componentCumulative = componentProportions.CumulativeSum();

现在，通过查看累积比例变得大于您想要的方差比例来确定您需要的维度数。知道这个数字后，从特征向量矩阵中只选择那些特征向量：

int numberOfVectors = // number of vectors that you need

for (int i = 0; i < rows; i++)
    for (int j = 0; j < numberOfVectors; j++)
        for (int k = 0; k < componentVectors[j].Length; k++)
            result[i][j] += z[i][k] * componentVectors[j][k];

在上面的示例中，我正在转换矩阵 z，它已经是平均居中的并且可以选择标准化。在转换另一组数据之前，不要忘记应用与原始矩阵相同的转换。

最后，请记住，手动执行上述所有操作是完全可选的。您真的应该免费使用 PrincipalComponentAnalysis 类为您完成所有这些繁重的工作！

【讨论】：