【问题标题】:How do I order vectors from sentence embeddings and give them out with their respective input?我如何从句子嵌入中订购向量并用它们各自的输入给出它们?
【发布时间】:2021-05-22 13:00:48
【问题描述】:

我设法为我的两个语料库中的每个句子生成向量,并计算每个可能对(点积)之间的余弦相似度:

import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

embeddings1 = ["I'd like an apple juice",
                                "An apple a day keeps the doctor away",
                                 "Eat apple every day",
                                 "We buy apples every week",
                                 "We use machine learning for text classification",
                                 "Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)

embeddings2 = ["I'd like an orange juice",
                                "An orange a day keeps the doctor away",
                                 "Eat orange every day",
                                 "We buy orange every week",
                                 "We use machine learning for document classification",
                                 "Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)

print(cosine_similarity(embeddings1, embeddings2))

array([[ 0.7882168 ,  0.3366559 ,  0.22973989,  0.15428472, -0.10180502,
                                                         -0.04344492],
       [ 0.256085  ,  0.7713026 ,  0.32120776,  0.17834462, -0.10769081,
                                                         -0.09398925],
       [ 0.23850328,  0.446203  ,  0.62606746,  0.25242645, -0.03946173,
                                                         -0.00908459],
       [ 0.24337521,  0.35571027,  0.32963073,  0.6373588 ,  0.08571904,
                                                         -0.01240187],
       [-0.07001016, -0.12002315, -0.02002328,  0.09045915,  0.9141338 ,
                                                          0.8373743 ],
       [-0.04525191, -0.09421931, -0.00631144, -0.00199519,  0.75919366,
                                                          0.9686416 ]]

为了获得有意义的输出,我需要对它们进行排序,然后将它们与相应的输入语句一起返回。有谁知道如何做到这一点?我没有找到该任务的任何教程。

【问题讨论】:

    标签: python numpy nlp embedding sentence-similarity


    【解决方案1】:

    您可以使用np.argsort(...) 进行排序,

    import tensorflow_hub as hub
    from sklearn.metrics.pairwise import cosine_similarity
    
    embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
    
    seq1 = ["I'd like an apple juice",
                                    "An apple a day keeps the doctor away",
                                     "Eat apple every day",
                                     "We buy apples every week",
                                     "We use machine learning for text classification",
                                     "Text classification is subfield of machine learning"]
    embeddings1 = embed(seq1)
    
    seq2 = ["I'd like an orange juice",
                                    "An orange a day keeps the doctor away",
                                     "Eat orange every day",
                                     "We buy orange every week",
                                     "We use machine learning for document classification",
                                     "Text classification is some subfield of machine learning"]
    embeddings2 = embed(seq2)
    
    a = cosine_similarity(embeddings1, embeddings2)
    

    def get_pairs(a, b):
    
     a = np.array(a)
     b = np.array(b)
    
     c = np.array(np.meshgrid(a, b))
     c = c.T.reshape(len(a), -1, 2)
    
     return c
    

    pairs = get_pairs(seq1, seq2)
    
    sorted_idx = np.argsort(a, axis=0)[..., None]
    
    sorted_pairs = pairs[sorted_idx]
    
    
    print(pairs[0, 0])
    print(pairs[0, 1])
    print(pairs[0, 2])
    

    ["I'd like an apple juice" "I'd like an orange juice"]
    ["I'd like an apple juice" 'An orange a day keeps the doctor away']
    ["I'd like an apple juice" 'Eat orange every day']
    

    【讨论】:

    • 好的,您将如何更一般地编写它,以便我可以将 500 行的输入与 5,000 行的输入相比,同时返回向量?
    • 它不起作用。我在第二个语料库的最后一个条目中输入了“我想要一个苹果汁”,您的代码只返回前三个条目而不进行分类。
    • 我的想法是在用 for 循环取出一个句子后执行嵌入,但它说这是错误的输入。
    • 我很确定,您的问题并非来自此解决方案。您能否提及您遇到的错误?
    • 当然,您的代码没有错误,但它只是返回执行分类的语料库。如果我使用 for 循环进行嵌入,我会收到以下消息: tensorflow.python.framework.errors_impl.InvalidArgumentError: input must be a vector, got shape: [] [[{{node StatefulPartitionedCall/StatefulPartitionedCall/text_preprocessor/tokenize/StringSplit/ StringSplit}}]] [Op:__inference_restored_function_body_5285] 函数调用堆栈:restored_function_body
    【解决方案2】:

    我传递的是字符串而不是字符串。问题解决了。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多