部分塔克分解答案

【问题标题】：partial tucker decomposition部分塔克分解
【发布时间】：2022-01-24 17:41:07
【问题描述】：

我想应用部分 tucker 分解算法来最小化 (60000,28,28) 的 MNIST 图像张量数据集，以便在之后应用另一种机器算法（如 SVM）时保留其特征。我有这段代码可以最小化张量的第二维和第三维

i = 16
j = 10
core, factors = partial_tucker(train_data_mnist, modes=[1,2],tol=10e-5, rank=[i,j])
train_datapartial_tucker = tl.tenalg.multi_mode_dot(train_data_mnist, factors, 
                              modes=modes, transpose=True)
test_data_partial_tucker = tl.tenalg.multi_mode_dot(test_data_mnist, factors, 
                              modes=modes, transpose=True)

当我在张量中使用partial_tucker 时，如何找到最佳排名[i,j]，这将为图像提供最佳降维同时保留尽可能多的数据？

【问题讨论】：

标签： python tensorflow machine-learning tensorly

【解决方案1】：

就像主成分分析一样，随着秩的增加，部分塔克分解会得到更好的结果，因为重构的最优均方残差更小。

一般来说，能够准确重建原始数据的特征（core 张量）可用于进行类似的预测（给定任何模型，我们可以预先进行转换，从 core 特征重建原始数据)。

import mxnet as mx
import numpy as np
import tensorly as tl
import matplotlib.pyplot as plt
import tensorly.decomposition

# Load data
mnist = mx.test_utils.get_mnist()
train_data = mnist['train_data'][:,0]


err = np.zeros([28,28]) # here I will save the errors for each rank
batch = train_data[::100] # process only 1% of the data to go faster
for i in range(1,28):
  for j in range(1,28):
    if err[i,j] == 0:
      # Decompose the data
      core, factors = tl.decomposition.partial_tucker(
                        batch, modes=[1,2], tol=10e-5, rank=[i,j])
      # Reconstruct data from features
      c = tl.tenalg.multi_mode_dot(core, factors, modes=[1,2]);
      # Calculate the RMS error and save
      err[i,j] = np.sqrt(np.mean((c - batch)**2));

# Plot the statistics
plt.figure(figsize=(9,6))
CS = plt.contour(np.log2(err), levels=np.arange(-6, 0));
plt.clabel(CS, CS.levels, inline=True, fmt='$2^{%d}$', fontsize=16)
plt.xlabel('rank 2')
plt.ylabel('rank 1')
plt.grid()
plt.title('Reconstruction RMS error');

通常你有一个平衡的排名更好的结果，即i和j彼此差别不大。

随着误差的增加，我们可以获得更好的压缩效果，我们可以按误差对(i,j) 进行排名，并仅绘制给定特征维度i * j 的误差最小的位置，如下所示

X = np.zeros([28, 28])
X[...] = np.nan;
p = 28 * 28;
for e,i,j in sorted([(err[i,j], i, j) for i in range(1, 28) for j in range(1, 28)]):
  if p < i * j:
    # we can achieve this error with some better compression
    pass
  else:
    p = i * j;
    X[i,j] = e;
plt.imshow(X)

在白色区域的任何地方都在浪费资源，选择

【讨论】：

【解决方案2】：

因此，如果您查看链接here 的tensorly 的源代码，您会看到相关函数partial_tucker 的文档说：

"""
Partial tucker decomposition via Higher Order Orthogonal Iteration (HOI)
Decomposes 'tensor' into a Tucker decomposition exclusively along 
the provided modes.

Parameters
----------
tensor: ndarray
modes: int list
list of the modes on which to perform the decomposition
rank: None, int or int list
size of the core tensor, 
if int, the same rank is used for all modes
"""

此函数的目的是为您提供近似值，以尽可能多地保存给定等级的数据。我不能给你哪个等级“将在保留尽可能多的数据的同时为图像提供最佳降维”，因为降维和精度损失之间的最佳权衡是抽象的没有客观“正确”答案的东西，因为这在很大程度上取决于您项目的具体目标以及您实现这些目标可用的计算资源。

如果我告诉你做“最好的排名”，它首先会消除这种近似分解的目的，因为“最好的排名”将是没有“损失”的排名，这不再是固定等级的近似和排序使得术语近似毫无意义。但是为了获得降维，离这个“最佳排名”还有多远，这不是任何人都可以客观地为你回答的问题。人们当然可以发表意见，但这种意见将取决于比我目前从您那里获得的更多信息。如果您正在寻找有关此权衡的更深入的观点以及最适合您的权衡，我建议您在 Stack 网络中的一个站点上发布一个有关您的情况的更多详细信息的问题，更关注维度的数学/统计基础Stack Overflow 更关注的编程方面，例如 Stack Exhange Cross Validated 或 Stack Exhange Data Science。

来源/参考/进一步阅读：

【讨论】：

谢谢，我会查看专门的堆栈，我认为您关于最佳排名是正确的，但无论如何我需要它，因为我正在做一个实验。