计算 KNN 的欧几里得距离答案

【问题标题】：computing the euclidean distance for KNN计算 KNN 的欧几里得距离
【发布时间】：2018-02-14 01:10:27
【问题描述】：

我已经看到很多为 KNN 计算欧几里得距离的示例，但不用于情感分类。

例如我有一句话“一场非常接近的比赛”

我如何计算句子“A great game”的欧几里得距离？

【问题讨论】：

不清楚句子的“欧几里得距离”是什么意思。要获得任何距离，您需要修复一些编码 - 例如，您可以使用计数向量、它们的二进制版本或 tfidf 向量。
假设您有一个link 的训练数据，并且您必须使用 KNN 对句子“A very close game”进行分类......类似的事情
这个数据有句子串。正如我之前提到的，有很多方法可以将它们矢量化。

标签： machine-learning sentiment-analysis knn nearest-neighbor euclidean-distance

【解决方案1】：

将一个句子视为多维空间中的一个点，只有在定义了坐标系后，才能计算欧几里得距离。例如。可以介绍一下

O1 - 一个句子长度（Length）
O2 - 一个字数 (WordsCount)
O2 - 字母中心（我只是想到了它）。它可以计算为句子中每个作品的字母中心的算术平均值。

CharsIndex = Sum(Char.indexInWord) / CharsCountInWord; CharsCode = Sum(Char.charCode) / CharsCount; AlphWordCoordinate = [CharsIndex, CharsCode]; WordsIndex = Sum(Words.CharsIndex) / WordsCount; WordsCode = Sum(Words.CharsCode) / WordsCount; AlphaSentenceCoordinate = (WordsIndex ^2+WordsCode^2+WordIndexInSentence^2)^1/2;

所以，欧几里得距离可以计算如下：

EuclidianSentenceDistance = (WordsCount^2 + Length^2 + AlphaSentenceCoordinate^2)^1/2

不是每个句子都可以转换成三维空间中的点，比如P[Length, Words, AlphaCoordinate]。有一个距离，您可以比较和分类句子。

我想这不是理想的方法，但我想向您展示一个想法。

import math

def calc_word_alpha_center(word):
    chars_index = 0;
    chars_codes = 0;
    for index, char in enumerate(word):
        chars_index += index
        chars_codes += ord(char)
    chars_count = len(word)
    index = chars_index / len(word)
    code = chars_codes / len(word)
    return (index, code)


def calc_alpha_distance(words):
    word_chars_index = 0;
    word_code = 0;
    word_index = 0;
    for index, word in enumerate(words):
        point = calc_word_alpha_center(word)
        word_chars_index += point[0]
        word_code += point[1]
        word_index += index
    chars_index = word_chars_index / len(words)
    code = word_code / len(words)
    index = word_index / len(words)
    return math.sqrt(math.pow(chars_index, 2) + math.pow(code, 2) + math.pow(index, 2))

def calc_sentence_euclidean_distance(sentence):
    length = len(sentence)

    words = sentence.split(" ")
    words_count = len(words)

    alpha_distance = calc_alpha_distance(words)

    return math.sqrt(math.pow(length, 2) + math.pow(words_count, 2) + math.pow(alpha_distance, 2))


sentence1 = "a great game"
sentence2 = "A great game"

distance1 = calc_sentence_euclidean_distance(sentence1)
distance2 = calc_sentence_euclidean_distance(sentence2)

print(sentence1)
print(str(distance1))

print(sentence2)
print(str(distance2))

控制台输出

a great game
101.764433866
A great game
91.8477000256

【讨论】：

我很困惑...您可以尝试使用我的示例进行计算吗？例如像这个链接：*.com/questions/17053459/…
我添加了代码示例。您可以使用它并尝试实现良好的功能质量。因为就目前而言，正如您所见，该函数对 char 寄存器等细微更改非常敏感。
我已经阅读了代码，但我认为它与我正在尝试做的不同......假设：训练句子：“一个伟大的游戏”未标记的句子：“一个非常接近的游戏”我想计算两个句子之间的欧几里得距离。从我读到的内容来看，我应该将每个句子转换为二进制，就像我之前评论中的链接一样......
你可以尝试申请Levenshtein distance，非常接近你的需要