处理 NaN 以进行距离计算的问题？答案

【问题标题】：Issue in handling NaN for distance calculation?处理 NaN 以进行距离计算的问题？
【发布时间】：2018-10-01 17:22:30
【问题描述】：

我有一个DataFrame 如下（为简单起见），以点作为索引列：

 import numpy as np
import pandas as pd
a = {'a' : [0.6,0.7,0.4,np.NaN,0.5,0.4,0.5,np.NaN],'b':['cat','bat','cat','cat','bat',np.NaN,'bat',np.nan]}
df = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
df

由于它有 NaN，我希望将列视为数字并执行以下操作：

for col in df.select_dtypes(include=['object']):
        s = pd.to_numeric(df[col], errors='coerce')
        if s.notnull().any():
            df[col] = s

将列转换为数值类型后，我想计算距离矩阵如下：

def distmetric(x,y):
    numeric5=x.select_dtypes(include=["number"])
    others5=x.select_dtypes(exclude=["number"])
    numeric6=y.select_dtypes(include=["number"])
    others6=y.select_dtypes(exclude=["number"])
    numnp5=numeric5.values
    catnp5=others5.values
    numnp6=numeric6.values
    catnp6=others6.values
    result3=np.around((np.repeat(numnp5, len(numnp6),axis=0) - np.tile(numnp6,(len(numnp5),1)))**2,3)
    catres3=~(np.equal((np.repeat(catnp5,len(catnp6),axis=0)),(np.tile(catnp6,(len(catnp5),1)))))
    sumtogeth3=result3.sum(axis=1)
    sumcattoget3=catres3.sum(axis=1)
    sum_result3=sumtogeth3+sumcattoget3
    final_result3=np.around(np.sqrt(sum_result3),3)
    final_result20=np.reshape(final_result3, (len(x.index),len(y.index)))
    return final_result20

metric=distmetric(df,df)
print(metric)

我得到一个距离矩阵如下：

 [[0.    1.005 0.2     nan 1.005 1.02  1.005   nan]
 [1.005 0.    1.044   nan 0.2   1.044 0.2     nan]
 [0.2   1.044 0.      nan 1.005 1.    1.005   nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [1.02  1.044 1.      nan 1.005 1.    1.005   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]]

我想得到如下输出：

            x1       x2       x3      x4      x5       x6       x7       x8
x1         0.0      1.005    0.2     1.0     1.005    1.02     1.005   1.414
x2         1.005    0.0     1.044   1.414    0.2      1.044    0.2     1.414
x3         0.2      1.044    0.0     1.0     1.005    1.0      1.005   1.414
x4         1.0      1.414    1.0     0.0     1.414    1.414    1.414    1.0
x5         1.005    0.2     1.005   1.414    0.0      1.005    0.0     1.414
x6         1.02     1.044    1.0    1.414    1.005    0.0      1.005    1.0
x7         1.005    0.2     1.005   1.414    0.1      1.005    0.0     1.414
x8         1.414    1.414   1.414    1.0     1.414     1.0     1.414    0.0

我想计算两个NaN 之间的距离应该为0，NaN 与任何数字或任何字符串之间的距离应该为1。有什么方法或方法吗？

编辑： 我正在以下列形式计算距离：

for each row:
     if col is numerical: 
         then calculate (x1 element)-(x2 element)**2 and return this value to squareresult
     if col is categorical:
         then compare x1 element and x2 element.
         if they are equal then cateresult=0 
         else cateresult=1
     totaldistanceresultforrow=sqrt(squareresult+cateresult)

注意：NaN-NaN=0 和 NaN-any Num 或 string=1（这里 '-' 是减法）

【问题讨论】：

为什么不将数据帧中的 NAN 值转换为整数 0.. 我想它可以解决问题...参考stackoverflow.com/questions/13295735/…
@ Sarthak Negi：根据我的算法，我无法将 NaN 转换为任何整数。如果我这样做，距离度量会在我的项目中产生问题。
@Sarthak Negi：如果是分类数据，我可以做到。这不会影响我的结果。但我不能为数字数据做到这一点。

标签： python python-3.x pandas numpy

【解决方案1】：

这对我有帮助：

square_res = (df['a'].values - df['a'][:, None]) ** 2
numeric=pd.DataFrame(square_res)
idx = numeric.isnull().all()
alltrueindices=np.where(idx)

for index in alltrueindices:
    numeric.loc[index, index] = 0
numeric = numeric.fillna(1)
df['b']=df['b'].replace(np.nan, '?')
cat_res = (df['b'].values != df['b'][:, None])
res = (numeric + cat_res) ** .5

print(res.round(3))

【讨论】：