【问题标题】:ner, spacy,sentence segmentationner, spacy, 句子分割
【发布时间】:2020-06-30 01:05:48
【问题描述】:

我想打破这个句子以便使用 spacy 处理它

Finally, on 1595 July 22 at 2h 40m am, when the sun was at 7° 59' 52" Leo, 101,487 distant from earth, Mars's mean longitude 11s 14° 9' 5", and anomaly 164° 48' 55", and consequent eccentric position from the vicarious hypothesis 17° 16' 36" Pisces: the apparent position of Mars, from the most select observations, was 4° 11' 10" Taurus, lat. 2° 30' S ^37. Thus we twice have Mars in the most opportune position, in quadrature with the sun, while the positions of earth and Mars are also distant by a quadrant.\n

我希望结果是这样的:

[
Finally, on 1595 July 22 at 2h 40m am, when the sun was at 7° 59' 52" Leo, 101,487 distant from earth, Mars's mean longitude 11s 14° 9' 5", and anomaly 164° 48' 55", and consequent eccentric position from the vicarious hypothesis 17° 16' 36" Pisces: the apparent position of Mars, from the most select observations, was 4° 11' 10" Taurus, lat. 2° 30' S ^37. ,

  Thus we twice have Mars in the most opportune position, in quadrature with the sun, while the positions of earth and Mars are also distant by a quadrant.\n ]

意思是两句,第一句应该在lat之后。 2° 30' S ^37。但自从纬度。有一个溺爱,它打破了lat之后的句子。

但是直到现在我都没有找到解决方案,我使用了这两种方法:

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text in ("lat."):
            # print("Detected:", token.text)
            doc[token.i].is_sent_start = False
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
nlp.pipeline

a.split('.')

我认为第一个代码中的一些小错误。

以上两种方法都不能按需要拆分句子!

一般来说,为了将段落分割成句子,您有什么建议? (尤其是当我们有)这种缩写的情况下存在

lat. 

【问题讨论】:

  • 意思是两句,第一句在lat之后。 2° 30' S ^37. 你能以更好/更清晰的格式分享文本吗? 两者都不起作用!这是什么意思? 一般来说,为了将段落分割成句子,您有什么建议? 使用专为使用自然语言而设计的库,您已经在这样做了。
  • 我已经编辑了文本。基本上问题是像“lat”这样的词。这是缩写导致不想要的句子中断,你将如何分割段落以纠正句子

标签: python nlp spacy


【解决方案1】:

我用过,效果很好

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
def SentenceSegmentation(Para):
        punkt_param = PunktParameters()
        abbreviation = ['lat', 'ch']  #any abbrivation  lat-> latitiude  ch--> chapter
        punkt_param.abbrev_types = set(abbreviation)
        tokenizer = PunktSentenceTokenizer(punkt_param)
        tokenizer.train(Para)
        return tokenizer.tokenize(Para)

【讨论】:

    猜你喜欢
    • 2019-02-11
    • 1970-01-01
    • 1970-01-01
    • 2021-09-06
    • 2017-10-23
    • 2020-02-06
    • 1970-01-01
    • 2015-12-21
    • 2017-08-03
    相关资源
    最近更新 更多