将 scikit 缩放数据映射回 ID答案

【问题标题】：Map scikit scaled data back to ID将 scikit 缩放数据映射回 ID
【发布时间】：2017-04-23 20:13:09
【问题描述】：

我有一个看起来像这样的pandas.DataFrame：

In [48]: df
Out[48]: 
        AMID         A         B         C
0  AMID-1000  0.149176  0.768200  0.689369
1  AMID-1001  0.169934  0.607390  0.471788
2  AMID-1002  0.632052  0.806657  0.994664
3  AMID-1003  0.003798  0.382427  0.894856
4  AMID-1004  0.182947  0.712373  0.870068
5  AMID-1005  0.385039  0.691643  0.546960
6  AMID-1006  0.971885  0.169759  0.804370
7  AMID-1007  0.443199  0.686212  0.377556
8  AMID-1008  0.149402  0.981370  0.588750
9  AMID-1009  0.214107  0.264285  0.463403

'AMID' 列保存数据点 ID，其余列中的每一列都是每个数据点的特征。

我想将此数据集与需要缩放数据的算法一起使用，因此对于每一列我都有mean == 0 和std == 1。我为此使用sklearn.preprocessing.StandardScaler，但是，为了扩展，我需要删除非数字'AMID' 列的数据集。

In [61]: from sklearn import preprocessing

In [62]: data = df[[_ for _ in df.columns.values.tolist() if _ not in ['AMID']]]

In [64]: scaler = preprocessing.StandardScaler().fit(data)

In [65]: data_scaled = scaler.transform(data)

In [66]: data_scaled
Out[66]: 
array([[ -6.60180258e-01,   6.63739262e-01,   9.55187160e-02],
       [ -5.84458777e-01,   1.47534202e-03,  -9.87448200e-01],
       [  1.10128130e+00,   8.22117198e-01,   1.61505880e+00],
       [ -1.19049913e+00,  -9.24989864e-01,   1.11828380e+00],
       [ -5.36991596e-01,   4.33827828e-01,   9.94906952e-01],
       [  2.00212895e-01,   3.48454485e-01,  -6.13293011e-01],
       [  2.34094244e+00,  -1.80081691e+00,   6.67913149e-01],
       [  4.12372276e-01,   3.26087187e-01,  -1.45646800e+00],
       [ -6.59357873e-01,   1.54163661e+00,  -4.05292050e-01],
       [ -4.23321269e-01,  -1.41153114e+00,  -1.02918017e+00]])

In [67]: data_scaled.mean(axis=0)
Out[67]: array([ -8.32667268e-17,  -4.44089210e-17,  -2.88657986e-16])

In [68]: data_scaled.std(axis=0)
Out[68]: array([ 1.,  1.,  1.])

到目前为止，情况看起来不错！

现在我可以继续将此数据提供给我的模型，然后使用测试数据进行测试（也使用相同的缩放器和拟合进行缩放）。但是，我需要能够准确地看到分类器对每个AMID 给出的预测。所以，我想我应该将每个数据点的缩放数据映射回AMID，然后使用分类器的.predict()方法分别尝试每个数据点，或者我应该以某种方式将.predict()的结果映射回AMID的列表。

我的第一个想法是将新值分配给原始数据框，如下所示：

In [73]: df_copy['A'] = data_scaled[:,0:1]

In [74]: df_copy
Out[74]: 
        AMID         A         B         C
0  AMID-1000 -0.660180  0.768200  0.689369
1  AMID-1001 -0.584459  0.607390  0.471788
2  AMID-1002  1.101281  0.806657  0.994664
3  AMID-1003 -1.190499  0.382427  0.894856
4  AMID-1004 -0.536992  0.712373  0.870068
5  AMID-1005  0.200213  0.691643  0.546960
6  AMID-1006  2.340942  0.169759  0.804370
7  AMID-1007  0.412372  0.686212  0.377556
8  AMID-1008 -0.659358  0.981370  0.588750
9  AMID-1009 -0.423321  0.264285  0.463403

但我不确定这是否会扭曲原始 'AMID' 和每列的缩放值之间的关联。

有没有更好的方法来做到这一点？

【问题讨论】：

However, what I need to do is map the scaled data back to the 'AMID', so that I can record the classifier's .predict() output for each AMID value separately不明白。你能重新制定一下吗？
我有每个数据点的基本事实，所以我实际上可以很好地评估分类器的性能，即使没有映射到AMID。但我还需要知道哪个特定的汽车 (AMID) 属于哪个类别。所以我在想这将需要遍历数据集并将每个数据点分别提供给分类器。简而言之，我想找出分类器对每个AMID 的预测。谢谢！
I have the ground truth for each one of the datapoints。如果你有课，那么你想得到什么？ --> But I also need to know which specific automobile (AMID) belongs to which class ?没看懂
@MMF 我正在尝试找到表示我的数据的最佳方式，因此我针对我的数据集的几种不同风格评估分类器的性能，并查看哪些数据集提供了最佳结果或允许分类器更有效地解决问题。对于这个二元问题，我有基本事实，因此我可以计算每个数据集风格的性能。
@MMF 问：But I also need to know which specific automobile (AMID) belongs to which class。 A：我想说的是，我需要知道算法为每个数据点预测了什么。

标签： python python-3.x pandas machine-learning scikit-learn

【解决方案1】：

IIUC，我只需将 AMID 设置为索引（这样它就不会干扰并使其之后更容易），然后在旅途中重新创建一个数据框，如下所示：

df.set_index('AMID', inplace=True)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
df = pd.DataFrame(scaler.fit_transform(df), index=df.index, columns=df.columns)
df

                  A         B         C
AMID                                   
AMID-1000 -0.660181  0.663739  0.095517
AMID-1001 -0.584459  0.001476 -0.987447
AMID-1002  1.101281  0.822116  1.615059
AMID-1003 -1.190499 -0.924988  1.118286
AMID-1004 -0.536990  0.433827  0.994909
AMID-1005  0.200213  0.348455 -0.613294
AMID-1006  2.340943 -1.800818  0.667911
AMID-1007  0.412372  0.326088 -1.456467
AMID-1008 -0.659357  1.541636 -0.405293
AMID-1009 -0.423322 -1.411532 -1.029181

如果您想将 AMID 作为列而不是索引，您可以使用 reset_index()，但恕我直言，作为索引更好（我假设您稍后想在该模型上拟合另一个模型...）

【讨论】：

涂成 été plus rapide que moi ！边茹埃。 Je l'avais proposé en commentaire！
哎呀，刚刚看到你也提出来了：）