网络抓取数据的词形还原答案

【问题标题】：Lemmatisation of web scraped data网络抓取数据的词形还原
【发布时间】：2019-08-13 07:42:58
【问题描述】：

假设我有一个文本文档，如下所示：

document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'

(或更复杂的文本示例：

document = '<p>Forde Education are looking to recruit a Teacher of Geography for an immediate start in a Doncaster Secondary school.</p> <p>The school has a thriving and welcoming environment with very high expectations of students both in progress and behaviour.&nbsp; This position will be working&nbsp;until Easter with a&nbsp;<em><strong>likely extension until July 2011.</strong></em></p> <p>The successful candidates will need to demonstrate good practical subject knowledge  but also possess the knowledge and experience to teach to GCSE level with the possibility of teaching to A’Level to smaller groups of students.</p> <p>All our candidate will be required to hold a relevant teaching qualifications with QTS  successful applicants will be required to provide recent relevant references and undergo a Enhanced CRB check.</p> <p>To apply for this post or to gain information regarding similar roles please either submit your CV in application or Call Debbie Slater for more information.&nbsp;</p>'

)

我正在应用一系列预处理 NLP 技术来获得该文档的“更清晰”的版本，同时为每个单词提取词干。

我为此使用以下代码：

stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')

# Remove all the special characters
document = re.sub(r'\W', ' ', document)

# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)

# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)

# Converting to lowercase
document = document.lower()

# Tokenisation
document = document.split()

# Stemming
document = [stemmer_3.stem(word) for word in document]

# Join the words back to a single document
document = ' '.join(document)

这为上面的文本文档提供了以下输出：

'am sent am anoth sent am third sent'

（以及更复杂示例的此输出：

'ford educ are look to recruit teacher of geographi for an immedi start in doncast secondari school the school has thrive and welcom environ with veri high expect of student both in progress and behaviour nbsp this posit will be work nbsp until easter with nbsp em strong like extens until juli 2011 strong em the success candid will need to demonstr good practic subject knowledg but also possess the knowledg and experi to teach to gcse level with the possibl of teach to level to smaller group of student all our candid will be requir to hold relev teach qualif with qts success applic will be requir to provid recent relev refer and undergo enhanc crb check to appli for this post or to gain inform regard similar role pleas either submit your cv in applic or call debbi slater for more inform nbsp'

)

我现在想做的是得到一个与上面完全相同的输出，但在我应用了词形还原而不是词干之后。

但是，除非我遗漏了什么，否则这需要将原始文档拆分为（合理的）句子，应用 POS 标记，然后实施词形还原。

但这里的情况有点复杂，因为文本数据来自网络抓取，因此您会遇到许多 HTML 标签，例如 、 等。

我的想法是，每当一个单词序列以一些常见的标点符号（句号、感叹号等）或 HTML 标签（如 、 等）结尾时，这应该被视为一个单独的句子。

例如上面的原始文档：

document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'

应该这样拆分：

['I am a sentence', 'I am another sentence', 'I am a third sentence']

然后我想我们将在每个句子上应用 POS 标记，将每个句子拆分为单词，应用词形还原和 .join() 将单词返回到单个文档，就像我在上面的代码中所做的那样。

我该怎么做？

【问题讨论】：

标签： python nlp text-parsing stemming lemmatization

【解决方案1】：

删除 HTML 标记是文本精炼的常见部分。您可以使用自己编写的规则，例如 text.replace('', '.') ，但有更好的解决方案：html2text。该库可以为您完成所有肮脏的 HTML 精炼工作，例如：

>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

您可以在 Python 代码中导入此库，也可以将其用作独立程序。

编辑：这是将文本拆分为句子的小链示例：

>>> document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
>>> text_without_html = html2text.html2text(document)
>>> refined_text = re.sub(r'\n+', '. ', text_without_html)
>>> sentences = nltk.sent_tokenize(refined_text)
>>> sentences

['I am a sentence.', 'I am another sentence.', 'I am a third sentence..']

【讨论】：

谢谢。老实说，我正在寻找一个完整的解决方案，它还将文本文档拆分为句子；不仅是如何从单个句子中删除 HTML 标签。
移除 HTML 标签后，您可以使用 NLTK 的 sent_tokenize 将您的精炼文本拆分为句子：from nltk.tokenize import sent_tokenize。当然没有什么神奇的函数可以包揽一切，提炼过程包含很多小的顺序函数:)
哈哈好吧，我已经有了整体思路；我只想要实现。
好吧，这看起来更好（赞成）。但是，请记住，我不仅有
标记，还有其他所有类型的 HTML 标记，所以我不确定这个 refined_text = re.sub(r'\n+', '. ', text_without_html) 是否适用于它们。
最后我也想要这个：['I am a sentence', 'I am another sentence', 'I am a third sentence']。所以没有任何标点符号的句子。