使用 Spacy 自定义句子分割答案

【问题标题】：Custom sentence segmentation using Spacy使用 Spacy 自定义句子分割
【发布时间】：2019-02-11 19:12:10
【问题描述】：

我是 Spacy 和 NLP 的新手。我在使用 Spacy 进行句子分割时遇到以下问题。

我试图将其标记为句子的文本包含编号列表（编号和实际文本之间有空格），如下所示。

import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
    print(sentence.text)

输出（1.,2.,3. 被视为单独的行）是：

This is first sentence.
  
Next is numbered list.
    
1.
Hello World!
 
2.
Hello World2!
  
3.
Hello World!

但是如果编号和实际文本之间没有空格，那么句子标记化就可以了。如下：

import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.\nNext is numbered list.\n1.Hello World!\n2.Hello World2!\n3.Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
    print(sentence.text)

输出（期望）是：

This is first sentence.
    
Next is numbered list.
   
1.Hello World!
    
2.Hello World2!
    
3.Hello World!

请建议我们是否可以自定义句子检测器来做到这一点。

【问题讨论】：

另见stackoverflow.com/questions/61785922/…

标签： nlp tokenize spacy sentence

【解决方案1】：

当您使用带有 spacy 的预训练模型时，句子会根据模型训练过程中提供的训练数据进行拆分。

当然，也有像您这样的情况，可能有人想要使用自定义的分句逻辑。这可以通过向 spacy 管道添加组件来实现。

对于您的情况，您可以添加一条规则，当有 {number} 时防止句子分裂。图案。

解决您的问题的方法：

import spacy
import re

nlp = spacy.load('en')
boundary = re.compile('^[0-9]$')

def custom_seg(doc):
    prev = doc[0].text
    length = len(doc)
    for index, token in enumerate(doc):
        if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
            doc[index+1].sent_start = False
        prev = token.text
    return doc

nlp.add_pipe(custom_seg, before='parser')
text = u'This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!'
doc = nlp(text)
for sentence in doc.sents:
    print(sentence.text)

希望对你有帮助！

【讨论】：