ner, spacy, 句子分割答案

【问题标题】：ner, spacy,sentence segmentationner, spacy, 句子分割
【发布时间】：2020-06-30 01:05:48
【问题描述】：

我想打破这个句子以便使用 spacy 处理它

Finally, on 1595 July 22 at 2h 40m am, when the sun was at 7° 59' 52" Leo, 101,487 distant from earth, Mars's mean longitude 11s 14° 9' 5", and anomaly 164° 48' 55", and consequent eccentric position from the vicarious hypothesis 17° 16' 36" Pisces: the apparent position of Mars, from the most select observations, was 4° 11' 10" Taurus, lat. 2° 30' S ^37. Thus we twice have Mars in the most opportune position, in quadrature with the sun, while the positions of earth and Mars are also distant by a quadrant.\n

我希望结果是这样的：

[
Finally, on 1595 July 22 at 2h 40m am, when the sun was at 7° 59' 52" Leo, 101,487 distant from earth, Mars's mean longitude 11s 14° 9' 5", and anomaly 164° 48' 55", and consequent eccentric position from the vicarious hypothesis 17° 16' 36" Pisces: the apparent position of Mars, from the most select observations, was 4° 11' 10" Taurus, lat. 2° 30' S ^37. ,

  Thus we twice have Mars in the most opportune position, in quadrature with the sun, while the positions of earth and Mars are also distant by a quadrant.\n ]

意思是两句，第一句应该在lat之后。 2° 30' S ^37。但自从纬度。有一个溺爱，它打破了lat之后的句子。

但是直到现在我都没有找到解决方案，我使用了这两种方法：

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text in ("lat."):
            # print("Detected:", token.text)
            doc[token.i].is_sent_start = False
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
nlp.pipeline

和

a.split('.')

我认为第一个代码中的一些小错误。

以上两种方法都不能按需要拆分句子！

一般来说，为了将段落分割成句子，您有什么建议？（尤其是当我们有）这种缩写的情况下存在

lat.

【问题讨论】：

意思是两句，第一句在lat之后。 2° 30' S ^37. 你能以更好/更清晰的格式分享文本吗？ 两者都不起作用！这是什么意思？ 一般来说，为了将段落分割成句子，您有什么建议？ 使用专为使用自然语言而设计的库，您已经在这样做了。
我已经编辑了文本。基本上问题是像“lat”这样的词。这是缩写导致不想要的句子中断，你将如何分割段落以纠正句子

标签： python nlp spacy

【解决方案1】：

我用过，效果很好

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
def SentenceSegmentation(Para):
        punkt_param = PunktParameters()
        abbreviation = ['lat', 'ch']  #any abbrivation  lat-> latitiude  ch--> chapter
        punkt_param.abbrev_types = set(abbreviation)
        tokenizer = PunktSentenceTokenizer(punkt_param)
        tokenizer.train(Para)
        return tokenizer.tokenize(Para)

【讨论】：