【发布时间】:2019-08-13 07:42:58
【问题描述】:
假设我有一个文本文档,如下所示:
document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
(或更复杂的文本示例:
document = '<p>Forde Education are looking to recruit a Teacher of Geography for an immediate start in a Doncaster Secondary school.</p> <p>The school has a thriving and welcoming environment with very high expectations of students both in progress and behaviour. This position will be working until Easter with a <em><strong>likely extension until July 2011.</strong></em></p> <p>The successful candidates will need to demonstrate good practical subject knowledge but also possess the knowledge and experience to teach to GCSE level with the possibility of teaching to A’Level to smaller groups of students.</p> <p>All our candidate will be required to hold a relevant teaching qualifications with QTS successful applicants will be required to provide recent relevant references and undergo a Enhanced CRB check.</p> <p>To apply for this post or to gain information regarding similar roles please either submit your CV in application or Call Debbie Slater for more information. </p>'
)
我正在应用一系列预处理 NLP 技术来获得该文档的“更清晰”的版本,同时为每个单词提取词干。
我为此使用以下代码:
stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')
# Remove all the special characters
document = re.sub(r'\W', ' ', document)
# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)
# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)
# Converting to lowercase
document = document.lower()
# Tokenisation
document = document.split()
# Stemming
document = [stemmer_3.stem(word) for word in document]
# Join the words back to a single document
document = ' '.join(document)
这为上面的文本文档提供了以下输出:
'am sent am anoth sent am third sent'
(以及更复杂示例的此输出:
'ford educ are look to recruit teacher of geographi for an immedi start in doncast secondari school the school has thrive and welcom environ with veri high expect of student both in progress and behaviour nbsp this posit will be work nbsp until easter with nbsp em strong like extens until juli 2011 strong em the success candid will need to demonstr good practic subject knowledg but also possess the knowledg and experi to teach to gcse level with the possibl of teach to level to smaller group of student all our candid will be requir to hold relev teach qualif with qts success applic will be requir to provid recent relev refer and undergo enhanc crb check to appli for this post or to gain inform regard similar role pleas either submit your cv in applic or call debbi slater for more inform nbsp'
)
我现在想做的是得到一个与上面完全相同的输出,但在我应用了词形还原而不是词干之后。
但是,除非我遗漏了什么,否则这需要将原始文档拆分为(合理的)句子,应用 POS 标记,然后实施词形还原。
但这里的情况有点复杂,因为文本数据来自网络抓取,因此您会遇到许多 HTML 标签,例如<br>、<p> 等。
我的想法是,每当一个单词序列以一些常见的标点符号(句号、感叹号等)或 HTML 标签(如<br>、<p> 等)结尾时,这应该被视为一个单独的句子。
例如上面的原始文档:
document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
应该这样拆分:
['I am a sentence', 'I am another sentence', 'I am a third sentence']
然后我想我们将在每个句子上应用 POS 标记,将每个句子拆分为单词,应用词形还原和 .join() 将单词返回到单个文档,就像我在上面的代码中所做的那样。
我该怎么做?
【问题讨论】:
标签: python nlp text-parsing stemming lemmatization