非结构化文本分组算法答案

【问题标题】：Algorithm for grouping unstructured text非结构化文本分组算法
【发布时间】：2020-11-23 00:29:15
【问题描述】：

我有很多非结构化的图书数据，例如：

指环王 J.R.R 托尔金

托尔金指环王状态良好

指环王 jrr 托尔金状态非常好

哈利波特与魔法石精装本

JK罗琳哈利波特与魔法石

鲍勃·史密斯的魔法石

我试图找出哪些句子代表同一本书。例如前 3 行应该组合在一起（指环王），接下来的 2 行应该组合在一起（哈利波特），最后一行是它自己的组（Bob Smith 的 The Stone of the Sorcerer）。有什么好的算法可以做到这一点？

（我在最初的问题之后添加了“The Stone of the Sorcerer by Bob Smith”，以强调仅匹配两个单词并不够明显）

【问题讨论】：

标签： algorithm nlp cluster-analysis

【解决方案1】：

我会过滤掉中性词（very, the, good, condition, ...）并根据常用词的数量匹配标题。如果您识别首字母，请删除这些点。为了进行有效的比较，请按字母顺序对单词进行排序。也全部小写。

jrr lord rings tolkien

lord rings tolkien 

jrr lord rings tolkien

harry potter sorcerer stone

harry jk potter rowling sorcerer

至少有两个词应该匹配。

【讨论】：

我尝试过类似的方法，但是当书名非常相似时会出现问题。例如：JK罗琳哈利波特和魔法石鲍勃史密斯的魔法石这两本书并不相同，但它们共享2个单词，在这样的算法下会被归类为相同的。
@phil：如果有两位同名作者使用相同的标题，情况会更糟。

【解决方案2】：

也许是这样。

from gensim.models import Word2Vec
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
from sklearn import cluster
from sklearn import metrics
from sklearn.decomposition import PCA
from scipy.cluster import hierarchy
from sklearn.cluster import AgglomerativeClustering

    
sentences = [["The Lord of the Rings J.R.R Tolkien"],
            ["Lord of the Rings Good condition"],
            ["Very good condition Lord of the Rings jrr Tolkien"],
            ["harry potter and the sorcerer's stone hardcover"],
            ["JK rowling harry potter and the sorcerer's stone"]]



m = Word2Vec(sentences, size=50, min_count=1, sg=1)

def vectorizer(sent, m):
    vec=[]
    numw=0
    for w in sent:
        try:
            if numw == 0:
                vec = m[w]
            else:
                vec = np.add(vec, m[w])
            numw += 1 
        except:
            pass
    return np.asarray(vec)/numw

l=[]
for i in sentences:
    l.append(vectorizer(i,m))
X=np.array(l)


n_clusters = 2
clf = KMeans(n_clusters=n_clusters,
             max_iter=100,
             init='k-means++',
             n_init=1)
labels=clf.fit_predict(X)
print(labels)
for index, sentence in enumerate(sentences):
    print(str(labels[index]) + ":" + str(sentence))

结果：

0:['The Lord of the Rings J.R.R Tolkien']
0:['Lord of the Rings Good condition']
1:['Very good condition Lord of the Rings jrr Tolkien']
0:["harry potter and the sorcerer's stone hardcover"]
1:["JK rowling harry potter and the sorcerer's stone"]

KMeans 几乎肯定不是对任何类型的文本数据进行聚类的最佳方式。您可能还想查看其他聚类算法。在这种情况下，凝聚聚类可能更稳健。

这很有趣。

例如，如果我改变这个......

for index, metric in enumerate(["cosine", "euclidean", "cityblock"]):
    clf = AgglomerativeClustering(n_clusters=n_clusters,
                                    linkage="average", affinity=metric)

我明白了……

1:['The Lord of the Rings J.R.R Tolkien']
0:['Lord of the Rings Good condition']
0:['Very good condition Lord of the Rings jrr Tolkien']
0:["harry potter and the sorcerer's stone hardcover"]
0:["JK rowling harry potter and the sorcerer's stone"]

【讨论】：

【解决方案3】：

如果您不担心性能（即可能需要一段时间），您可以做的是将每个字符串相互比较 O(n ^ 2)，并生成以下内容：

2 个字符串之间最长的连续字符匹配，忽略标点和大写（即仅比较 [0-9A-Za-z]，跳过所有其他字符）。（关键）
上述变量的长度。（分数）然后，分数将决定保留哪些“最长的字符串匹配”，以及丢弃哪些。

鉴于您的书单：

哈利波特与魔法石精装本
JK 罗琳哈利波特与魔法石
Bob Smith 的魔法石

书 1 和 2 分享：“哈利波特与魔法石” 第一册和第三册分享：《巫师》因为第一个比第二个长，所以第一册只保留索引键“哈利波特与魔法石”。

然后我们对这个键上的数据进行分组。它会相当慢（对于大型数据集非常慢），但应该会给你不错的准确性。

【讨论】：