【问题标题】:Force spacy not to parse punctuation?强制 spacy 不解析标点符号?
【发布时间】:2019-09-15 16:10:46
【问题描述】:

有没有办法强制 spacy 不将标点符号解析为单独的标记???

 nlp = spacy.load('en')

 doc = nlp(u'the $O is in $R')

  [ w for w in doc ]
  : [the, $, O, is, in, $, R]

我想要:

  : [the, $O, is, in, $R]

【问题讨论】:

标签: python tokenize spacy punctuation


【解决方案1】:

为 spaCy 的 Tokenizer 类自定义 prefix_search 函数。请参阅documentation。比如:

import spacy
import re
from spacy.tokenizer import Tokenizer

# use any currency regex match as per your requirement
prefix_re = re.compile('''^\$[a-zA-Z0-9]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'the $O is in $R')
print([t.text for t in doc])

# ['the', '$O', 'is', 'in', '$R']

【讨论】:

    【解决方案2】:

    是的,有。例如,

    import spacy
    import regex as re
    from spacy.tokenizer import Tokenizer
    
    prefix_re = re.compile(r'''^[\[\+\("']''')
    suffix_re = re.compile(r'''[\]\)"']$''')
    infix_re = re.compile(r'''[\(\-\)\@\.\:\$]''') #you need to change the infix tokenization rules
    simple_url_re = re.compile(r'''^https?://''')
    
    def custom_tokenizer(nlp):
        return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                         suffix_search=suffix_re.search,
                         infix_finditer=infix_re.finditer,
                         token_match=simple_url_re.match)
    
    nlp = spacy.load('en_core_web_sm')
    nlp.tokenizer = custom_tokenizer(nlp)
    
    doc = nlp(u'the $O is in $R')
    print [w for w in doc] #prints
    
    [the, $O, is, in, $R]
    

    您只需将“$”字符添加到中缀正则表达式(显然带有转义字符“\”)。

    旁白:包含前缀和后缀以展示 spaCy 标记器的灵活性。在您的情况下,只需中缀正则表达式就足够了。

    【讨论】:

      猜你喜欢
      • 2022-01-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-02-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多