Tensorflow KNN：我们如何分配 K 参数来定义 KNN 中的邻居数量？答案

【问题标题】：Tensorflow KNN : How can we assign the K parameter for defining number of neighbors in KNN?Tensorflow KNN：我们如何分配 K 参数来定义 KNN 中的邻居数量？
【发布时间】：2017-08-20 21:42:08
【问题描述】：

我已经开始在 python tensorflow 库上使用 K-Nearest-Neighbors 方法进行机器学习项目。我没有使用 tensorflow 工具的经验，所以我在 github 中找到了一些代码并针对我的数据进行了修改。

我的数据集是这样的：

2,2,2,2,0,0,3
2,2,2,2,0,1,0
2,2,2,4,2,2,1
...
2,2,2,4,2,0,0

这是实际运行良好的代码：

import tensorflow as tf
import numpy as np

# Whole dataset => 1428 samples
dataset = 'car-eval-data-1.csv'
# samples for train, remaining for test
samples = 1300
reader = np.loadtxt(open(dataset, "rb"), delimiter=",", skiprows=1, dtype=np.int32)

train_x, train_y = reader[:samples,:5], reader[:samples,6]
test_x, test_y = reader[samples:, :5], reader[samples:, 6]

# Placeholder you can assign values in future. its kind of a variable
#  v = ("variable type",[None,4])  -- you can have multidimensional values here
training_values = tf.placeholder("float",[None,len(train_x[0])])
test_values     = tf.placeholder("float",[len(train_x[0])])

# MANHATTAN distance
distance = tf.abs(tf.reduce_sum(tf.square(tf.subtract(training_values,test_values)),reduction_indices=1))

prediction = tf.arg_min(distance, 0)
init = tf.global_variables_initializer()

accuracy = 0.0

with tf.Session() as sess:
    sess.run(init)
    # Looping through the test set to compare against the training set
    for i in range (len(test_x)):
        # Tensor flow method to get the prediction near to the test parameters in the training set.
        index_in_trainingset = sess.run(prediction, feed_dict={training_values:train_x,test_values:test_x[i]})    

        print("Test %d, and the prediction is %s, the real value is %s"%(i,train_y[index_in_trainingset],test_y[i]))
        if train_y[index_in_trainingset] == test_y[i]:
        # if prediction is right so accuracy increases.
            accuracy += 1. / len(test_x)

print('Accuracy -> ', accuracy * 100, ' %')

我唯一不明白的是，如果是 KNN 方法，那么必须有一些 K 参数 来定义 用于预测的邻居数每个测试样本的标签。
我们如何分配 K 参数来调整代码的最近邻数？
有没有办法修改这段代码以使用K参数？

【问题讨论】：

我还添加了我用于此代码的数据集的链接以供公众使用：uploadkadeh.com/5u3zc3jdqqei

标签： machine-learning tensorflow python-3.6 knn

【解决方案1】：

您是对的，上面的示例没有选择 K-Nearest neighbors 的规定。在下面的代码中，我添加了添加此类参数（knn_size）以及其他更正的功能

import tensorflow as tf
import numpy as np

# Whole dataset => 1428 samples
dataset = 'PATH_TO_DATASET_CSV'
knn_size = 1
# samples for train, remaining for test
samples = 1300
reader = np.loadtxt(open(dataset, "rb"), delimiter=",", skiprows=1, dtype=np.int32)

train_x, train_y = reader[:samples,:6], reader[:samples,6]
test_x, test_y = reader[samples:, :6], reader[samples:, 6]

# Placeholder you can assign values in future. its kind of a variable
#  v = ("variable type",[None,4])  -- you can have multidimensional values here
training_values = tf.placeholder("float",[None, len(train_x[0])])
test_values     = tf.placeholder("float",[len(train_x[0])])


# MANHATTAN distance
distance = tf.abs(tf.reduce_sum(tf.square(tf.subtract(training_values,test_values)),reduction_indices=1))

# Here, we multiply the distance by -1 to reverse the magnitude of distances, i.e. the largest distance becomes the smallest distance
# tf.nn.top_k returns the top k values and their indices, here k is controlled by the parameter knn_size 
k_nearest_neighbour_values, k_nearest_neighbour_indices = tf.nn.top_k(tf.scalar_mul(-1,distance),k=knn_size)

#Based on the indices we obtain from the previous step, we locate the exact class label set of the k closest matches in the training data
best_training_labels = tf.gather(train_y,k_nearest_neighbour_indices)

if knn_size==1:
    prediction = tf.squeeze(best_training_labels)
else:
    # Now we make our prediction based on the class label that appears most frequently
    # tf.unique_with_counts() gives us all unique values that appear in a 1-D tensor along with their indices and counts 
    values, indices, counts = tf.unique_with_counts(best_training_labels)
    # This gives us the index of the class label that has repeated the most
    max_count_index = tf.argmax(counts,0)
    #Retrieve the required class label
    prediction = tf.gather(values,max_count_index)




init = tf.global_variables_initializer()

accuracy = 0.0

with tf.Session() as sess:
    sess.run(init)
    # Looping through the test set to compare against the training set
    for i in range (len(test_x)):


        # Tensor flow method to get the prediction near to the test parameters in the training set.
        prediction_value = sess.run([prediction], feed_dict={training_values:train_x,test_values:test_x[i]})

        print("Test %d, and the prediction is %s, the real value is %s"%(i,prediction_value[0],test_y[i]))
        if prediction_value[0] == test_y[i]:
        # if prediction is right so accuracy increases.
            accuracy += 1. / len(test_x)

print('Accuracy -> ', accuracy * 100, ' %')

【讨论】：

它可以工作，但是有一个错误：print("Test %d, and the prediction is %s, the real value is %s"%(i,train_y[index_in_trainingset],test_y[i]))IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices。你知道为什么会这样吗？
命令sess.run() 返回-1。我想这就是问题所在。你知道为什么会这样吗？
@MasoudMasoumiMoghadam - 是的，值将是负数，因为我们乘以 -1 才能应用 tf.nn.top_k()。所以我想我应该采用绝对值进行预测。我已做出相应的更改。
@MasoudMasoumiMoghadam - 虽然上述编辑将为您提供 sess.run(prediction) 的正值，但我相信这不是您问题的根源。对于您的代码版本， sess.run(prediction) 返回具有最小值的距离索引，但是对于我提供的代码，您实际上得到的是预测值而不是索引。所以 train_y[index_in_trainingset] 看起来像你的问题的根源。相反，只需打印 "index_in_trainingset" 并且你的 if 条件应该是 - " if index_in_trainingset == test_y[i]: " 。我认为最好重命名“index_in_trainingset”
我进行了更改并且代码正在运行，但我不知道为什么它一直预测相同的数字。即使我更改了K 的值。在我使用 K 参数之前，我的准确率达到了 70%，而现在它不到 20%。亲，你有什么建议？