如何根据文档相似度对文本数据进行分组？答案

【问题标题】：How to group text data based on document similarity?如何根据文档相似度对文本数据进行分组？
【发布时间】：2017-11-07 14:08:52
【问题描述】：

考虑如下数据框

df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
                             'How are you doing?' ]})

问题 0 你在做什么？ 1 今晚你在做什么？ 2 你现在在做什么？ 3 你叫什么名字？ 4 你的昵称是什么？ 5 你的全名是什么？ 6 我们见面好吗？ 7 你好吗？

如何对具有相似问题的数据框进行分组？即如何获得像下面这样的组

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')

6 我们见面好吗？名称：问题，dtype：对象 3 你叫什么名字？ 4 你的昵称是什么？ 5 你的全名是什么？名称：问题，dtype：对象 0 你在做什么？ 1 今晚你在做什么？ 2 你现在在做什么？ 7 你好吗？名称：问题，dtype：对象

here 提出了类似的问题，但不太清楚，因此没有回答该问题

【问题讨论】：

其实对于这样的问题，NLP/余弦相似度确实是最好的前进方式。
是的，我现在正在处理它。一旦我成功，肯定会更新。还是初学者。 :)。你的解决方案也很棒:)
NLP 或者可能是模糊的
@Wenfuzzywuzzy 听起来很有希望，但还没有使用它。你能在此基础上添加一个解决方案吗？
code.activestate.com/recipes/52213也许你可以根据你自己的数据做更多的研究，因为NPL设计是多种多样的，这完全取决于你正在使用的数据。

标签： python pandas group-by nltk similarity

【解决方案1】：

这是一个非常大的方法，通过在系列中的所有元素之间找到normalized similarity score，然后通过新获得的转换为字符串的相似性列表对它们进行分组。即

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

def convert_tag(tag):   
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), 
     Synset('friend.n.01')]
    """

    synsetlist =[]
    tokens=nltk.word_tokenize(doc)
    pos=nltk.pos_tag(tokens)    
    for tup in pos:
        try:
            synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
        except:
            continue           
    return synsetlist

def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """

    highscores = []
    for synset1 in s1:
        highest_yet=0
        for synset2 in s2:
            try:
                simscore=synset1.path_similarity(synset2)
                if simscore>highest_yet:
                    highest_yet=simscore
            except:
                continue

        if highest_yet>0:
             highscores.append(highest_yet)  

    return sum(highscores)/len(highscores)  if len(highscores) > 0 else 0

def document_path_similarity(doc1, doc2):
    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)
    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2


def similarity(x,df):
    sim_score = []
    for i in df['Questions']:
        sim_score.append(document_path_similarity(x,i))
    return sim_score

从上面定义的方法我们现在可以做

df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')

输出：

这不是解决问题的最佳方法，而且速度非常慢。任何新方法都受到高度赞赏。

【讨论】：

【解决方案2】：

您应该首先对列表/数据框列中的所有名称进行排序，然后仅对 n-1 行运行相似性代码，即对于每一行，将其与下一个元素进行比较。如果两者相似，您可以将它们归类为 1 或 0 并通过列表进行解析。而不是将每一行与 n^2 的所有其他元素进行比较。

【讨论】：