如何使用 NLP 从给定文本中提取子字符串？答案

【问题标题】：How to extract a substring from a given text using NLP?如何使用 NLP 从给定文本中提取子字符串？
【发布时间】：2019-09-17 10:51:11
【问题描述】：

我正在尝试从不会为答案添加任何值的文本中提取子字符串。我用 n-gram 尝试过，但没有得到令人满意的结果。

我正在尝试使用谷歌通用句子编码器查找两个文本之间的相似性。我观察到，如果我在将文本传递给编码器之前清理文本，我会得到更好的结果。我想提取从问题中重复的文本，因为它不会为答案增加任何价值。

def extract_answer(question,answer):
   << some code goes here >>
   return extracted_text

Question = "Why is the plasma membrane called a selectively permeable membrane?"

Answer = "The cell membrane or the plasma membrane is known as a selectively permeable membrane because it regulates the movement of substances in and out of the cell. This means that the plasma membrane allows the entry of only some substances and prevents the movement of some other materials."

extracted_answer = extract_answer(Question,Answer)

print(extracted_answer) 



Sample 1
---------

Input
-------
Question: Why is the plasma membrane called a selectively permeable membrane?
Answer: The cell membrane or the plasma membrane is known as a selectively permeable membrane because it regulates the movement of substances in and out of the cell. This means that the plasma membrane allows the entry of only some substances and prevents the movement of some other materials.

Expected Output
---------------

Output: it regulates the movement of substances in and out of the cell. This means that the plasma membrane allows the entry of only some substances and prevents the movement of some other materials.


Sample 2
----------  

Input
-------
Question: Why is the diver able to cross the river?
Answer: The swimmer is able to cross the river because the particles of matter have space between them. 

Expected Output
---------------

Output: particles of matter have space between them.

【问题讨论】：

标签： regex python-3.x machine-learning nlp

【解决方案1】：

一般情况（没有模式的问答）：

您的问题以某种方式包含在问答区域中。今天的许多研究使用技术从文档中提取答案，这可能是您正在寻找的。

该领域的最新研究可以在 ACL（计算机语言学协会）中找到：https://aclweb.org/aclwiki/Question_Answering_(State_of_the_art)

一般文本就是这种情况，没有模式。

直接问题和答案（遵循模式）：

但是，如果您的数据遵循“什么是 x？-> X 是 blablabla”之类的模式，您可以使用一组可能的尝试，就像我在下面建议的那样（我使用过 Spacy，但逻辑是Google NLP 类似）：

import spacy

nlp = spacy.load('en')

#You can use Machine Learning to find question types better than manual description!
question_type =  {"why":"reason", "what":"factoid", "how":"quantity"}
explanation_words = {"reason":["because"], "factoid":["is", "are"], "quantity":["is", "are"]}

questions = ['Why is the plasma membrane called a selectively permeable membrane?', 'What is a car?', 'How many wheels are there in a car?']
answers = ['The cell membrane or the plasma membrane is known as a selectively permeable membrane because it regulates the movement of substances in and out of the cell. This means that the plasma membrane allows the entry of only some substances and prevents the movement of some other materials.',
          'A car is a vehicle used for human or object transportation.',
          'There are usually 4 wheels in a common car.']

for idx, qa in enumerate(questions):
    doc = nlp(qa)
    q_type = ""
    for token in doc:
        if token.pos_ in ["ADV", "PRON"]:
            q_type = question_type[token.text.lower()]
            break
    answer_doc = nlp(answers[idx])
    answer_start_pos = -1
    for token in answer_doc:
        if token.text in explanation_words[q_type]:
            answer_start_pos = token.idx
            break
    print("\nQuestion: ",qa,"\nAnswer:"," ".join([token.text for token in answer_doc if token.idx > answer_start_pos]))

结果：

问题：为什么质膜被称为选择性渗透膜？膜？
答：调节物质进出的细胞。这意味着质膜允许只有一些物质，并防止一些其他材料的运动 .

问题：什么是汽车？
答：供人或物使用的交通工具交通。

问题：一辆汽车有几个轮子？
答：通常是 4 普通汽车的轮子。

由您检查所有问题类型并美化结果。

【讨论】：