Knn 赋予距离上的特定特征更多的权重答案

【问题标题】：Knn give more weight to specific feature in distanceKnn 赋予距离上的特定特征更多的权重
【发布时间】：2020-10-01 00:56:13
【问题描述】：

我正在使用Kobe Bryant Dataset。我希望用 KnnRegressor 预测 shot_made_flag。

我已经使用game_date 提取year 和month 特征：

# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))

# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])

并且我希望使用season、year、month 功能在距离函数中赋予它们更大的权重，因此距离当前事件更近的事件将是更近的邻居，但仍与潜在的其他事件保持合理的距离数据点，例如，我不希望同一天的事件因为日期功能而成为最近的邻居，但它会考虑到其他功能，例如shot_range 等。
为了给它更多的权重，我尝试将metric 参数与自定义距离函数一起使用，但该函数的参数只是numpy 数组，没有熊猫的列信息，所以我不确定我能做什么以及如何实现我正在尝试做的事情。

编辑：

对日期特征使用更大的权重来找到最佳 k 值，其中 cv of 10 在 k from [1, 100] 上运行：

from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
                'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features

not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)

neighbors = [x for x in range(1, 100)]
cv_scores = []

weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
 X.columns.get_loc("year"),
 X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum()  #Normalize weights

def my_distance(x, y):
    dist = ((x-y)**2)
    return np.dot(dist, weight)

for k in neighbors:
    print('k: ', k)
    knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
    cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))

#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()

运行速度真的很慢，知道如何让它更快吗？加权特征的思想是寻找更接近数据点日期的邻居以避免数据泄漏和寻找最优k的cv。

【问题讨论】：

标签： pandas machine-learning scikit-learn knn weighted-average

【解决方案1】：

首先，您必须准备一个 numpy 1D weight 数组，为每个特征指定权重。你可以这样做：

weight = np.ones((M,))  # M is no of features
weight[[1,7,10]] = 2    # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum()  #Normalize weights

您可以使用kobe_data_encoded.columns 在数据框中查找season、year、month 特征的索引，以替换上面的第二行。

现在定义一个距离函数，根据准则，它必须采用两个 1D numpy 数组。

def my_dist(x,y):
    global weight     #1D array, same shape as x or y
    dist = ((x-y)**2) #1D array, same shape as x or y
    return np.dot(dist,weight)  # a scalar float

并将KNeighborsRegressor初始化为：

knn = KNeighborsRegressor(metric=my_dist)

编辑：为了提高效率，您可以预先计算距离矩阵，并在KNN 中重用它。这应该通过减少对my_dist 的调用来显着提高速度，因为这个非向量化的自定义python 距离函数非常慢。所以现在 -

dist = np.zeros((len(X),len(X)))  #Computing NXN distance matrix
for i in range(len(X)):           # You can halve this by using the fact that dist[i,j] = dist[j,i]
    for j in range(len(X)):
        dist[i,j] = my_dist(X[i],X[j])

for k in neighbors:
    print('k: ', k)
    knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed' 
    cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X

我无法测试它，所以如果有问题请告诉我。

【讨论】：

嘿，我已经使用您提供的 dist 函数编辑了我的操作。我正在尝试使用 cv 找到最佳 k 并通过使用加权特征来避免数据泄漏。仍在从[1, 100] 运行的k 上运行几个ks 需要很长时间。您能否看一下我提供的代码，或许可以就如何使它变得更好提供一些见解？
请告诉我上面的基准测试结果，我现在无法测试它
嘿，len(x) 是25697 所以分配(len(x), len(x)) 导致我的jupyter 与MemoryError 出错有关我能做些什么的任何想法？找不到怎么做
也试过dist = np.full((len(X),len(X)), -1, dtype=np.int8)但没用

【解决方案2】：

只需添加 Shihab 关于距离计算的答案。可以按照post 中的建议使用 scipy pdist，这样更快更高效。

from scipy.spatial.distance import pdist, minkowski, squareform

# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)

【讨论】：